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Preface 


Machine Learning: Modeling Data Locally and Globally delivers the 
main contemporary themes and tools in machine learning including proba¬ 
bilistic generative models and Support Vector Machines. These themes are 
discussed or reformulated from either a local view or a global view. Diffe¬ 
rent from previous books that only investigate machine learning algorithms 
locally or globally, this book presents a unified and new picture for machine 
learning both locally and globally. Within the new picture, various seemly 
different machine learning models and theories are bridged in an elegant and 
systematic manner. For precise and thorough understanding, this book also 
presents applications of the new hybrid theory. 

This book not only provides researchers with the latest research results 
lively and timely, but also presents an excellent overview on machine learning. 
Importantly, the new line of learning both locally and globally goes through 
the whole book and makes various learning models understandable to a large 
proportion of audience including researchers in machine learning, practition¬ 
ers in pattern recognition, and graduate students. 


Kaizhu Huang 
Haiqin Yang 
Irwin King 
Michael R. Lyu 


The Chinese Univ. of Hong Kong, 
Jan. 2008 
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Introduction 


The objective of this book is to establish a framework which combines two 
different paradigms in machine learning: global learning and local learning. 
The combined model demonstrates that a hybrid learning of these two dif¬ 
ferent schools of approaches can outperform each isolated approach both 
theoretically and empirically. Global learning focuses on describing a phe¬ 
nomenon or modeling data in a global way. For example, a distribution over 
the variables is usually estimated for summarizing the data. Its output can 
usually reconstruct the data. This school of approaches, including Bayesian 
Networks [8, 13, 30], Gaussian Mixture Models [3, 21], and Hidden Markov 
Models [2, 25], has a long and distinguished history, which has been exten¬ 
sively applied in artificial intelligence [26], pattern recognition [9], and com¬ 
puter vision [7]. On the other hand, local learning does not intend to sum¬ 
marize a phenomenon, but builds learning systems by concentrating on some 
local parts of data. It lacks the flexibility yet surprisingly demonstrates supe¬ 
rior performance to global learning according to recent researches [4, 16, 15]. 
In this book, a bridge has been established between these two different 
paradigms. Moreover, the resulting principled framework subsumes several 
important models, which respectively locate themselves into the global learn¬ 
ing paradigm and the local learning paradigm. 

In this chapter, we address the motivations of the two different learning 
frameworks. As a summary, we present the objectives of this book and outline 
the main models or the contributions. Finally, we provide an overview of the 
rest of this book. 


1.1 Learning and Global Modeling 

When studying real world phenomena, scientists are always wondering whether 
some underlying laws or nice mathematical formulae exist for governing these 
complex phenomena. Moreover, in practice, due to incomplete information, 
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the phenomena are usually nondeterministic. This motivates to base proba¬ 
bilistic or statistical models to perform a global investigation on sampled data 
from the phenomena. A common way for achieving this goal is to fit a density 
on the observations of data. With the learned density, people can then in¬ 
corporate prior knowledge, conduct predictions, and perform inferences and 
marginalizations. One main category in the framework of global learning is 
the so-called generative learning. By assuming a specific mathematical model 
on the observations of data, e.g. a Gaussian distribution, the phenomena can 
therefore be described or re-generated. Fig. 1.1 illustrates such an example. 
In this figure, two classes of data are plotted as *’s for the first class and 
o’s for the other class. The data can thus be modeled as two different mix¬ 
tures of Gaussian distributions as illustrated in Fig. 1.2. By knowing only the 
parameters of these distributions, one can then summarize the phenomena. 
Furthermore, one can clearly employ this information to distinguish one class 
of data from the other class or simply know how to separate two classes. This 
is also well-known as Bayes optimal decision problems [12, 6]. 



Fig. 1.1. Two classes of two-dimensional data 


In the development of learning approaches within the community of ma¬ 
chine learning, there has been a migration from the early rule-based meth¬ 
ods [11, 32] wanting more involvement of domain experts, to widely-used 
probabilistic global models mainly driven by data itself [5, 9, 14, 17, 22, 33]. 
However, one question for most probabilistic global models is what kind of 
global models, or more specifically, which type of densities should be speci¬ 
fied beforehand for summarizing the phenomena. For some tasks, this can be 
prescribed by a slight introduction of domain knowledge from experts. Unfor¬ 
tunately, due to both the increasing sophistication of the real world learning 
tasks and active interactions among different subjects of research, it is more 
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Fig. 1.2. An illustration of distribution-based classifications (also known as 
the Bayes optimal decision theory). Two Gaussian mixtures are engaged to 
model the distribution of two classes of data respectively. The distribution 
can then be used to construct the decision plane 


and more difficult to obtain fast and valuable suggestions from experts. A fur¬ 
ther question is thus proposed, i.e. what is the next step in the community 
of machine learning, after experiencing a migration from rule-based models 
to probabilistic global models? Recent progress in machine learning seems to 
imply local learning as a solution. 


1.2 Learning and Local Modeling 

Global modeling addresses describing phenomena, no matter whether the 
summarized information from the observations is applicable to specific tasks 
or not. Moreover, the hidden principle under global learning is that infor¬ 
mation can be accurately extracted from data. On the other hand, local 
learning [10, 27, 28] which recently attracts active attention in the machine 
learning community, usually regards that a general and accurate global learn¬ 
ing is an impossible mission. Therefore, local learning focuses on capturing 
only local yet useful information from data. Furthermore, recent research 
progress and empirical study demonstrate that this much different learning 
paradigm is superior to global learning in many facets. 

In further details, instead of globally modeling data, local learning is more 
task-oriented. It does not aim to estimate a density from data as in global 
learning, which is usually an intermediate step for many tasks such as pattern 
recognitions (note that the distribution or density obtained by global lear¬ 
ning actually is not directly related to the classification itself); it also does not 
intend to build an accurate model to fit the observations of data globally. Dif¬ 
ferently, it only extracts useful information from data and directly optimizes 
the learning goal. For example, when used in learning classifiers from data, 
only those observations of data around the separating plane need to be ac¬ 
curate, while inaccurate modeling over other data is certainly acceptable for 
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the classification purpose. Fig. 1.3 illustrates such a problem. In this figure, 
the decision boundary is constructed only based on those filled points, while 
other points make no contributions to the classification plane (the decision 
boundary is given based on the Gabriel Graph method [1, 18, 34]). 



Fig. 1.3. An illustration of local learning (also known as the Gabriel 
Graph classification). The decision boundary is just determined by 
some local points indicated as filled points 

However, although containing promising performance, local learning ap¬ 
pears to locate itself at another extreme end to global learning. Employing 
only local information may lose the global view of data. Consequently, some¬ 
times, it cannot grasp the data trend, which is critical for guaranteeing better 
performance for future data. This can be seen in the example as illustrated 
in Fig. 1.4. In this figure, the decision boundary (also constructed by the 
Gabriel Graph classification) is still determined by some local points indi¬ 
cated as filled points. Clearly, this boundary does not grasp the data trend. 



Fig. 1.4. An illustration on that local learning cannot grasp data trend. 
The decision boundary (constructed by the Gabriel Graph classification) 
is determined by some local points indicated as filled points. It, however, 
loses the data trend. The decision plane should be obviously closer to the 
filled squares rather than locating itself in the middle of filled CPs and o’s 
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More specifically, the class associated with o’s is obviously more scattered 
than the class 

associated with CPs on the axis indicated as dashed line. Therefore, a 
more promising decision boundary should lie closer to filled CPs than those 
filled o’s instead of lying midway between filled points. A similar example 
can also be seen in Chapter 2 on a more principled local learning model, i.e. 
the current state-of-the-art classifier, Support Vector Machines (SVM) [31]. 
Targeting this problem, we then suggest a hybrid learning in this book. 


1.3 Hybrid Learning 

There are complementary advantages for both local learning and global lear¬ 
ning. Global learning summarizes data and provides practitioners with know¬ 
ledge on the structure, independence, and trend of data, since with the precise 
modeling of phenomena, the observations can be accurately regenerated and 
therefore can be studied or analyzed thoroughly. However, this also presents 
difficulties in how to choose a valid model to describe all the information 
(also called the problem of model selection). In comparison, local learning 
directly employs part of information, critical for the specific oriented tasks, 
and does not assume models to re-synthesize/restore the whole road-map of 
data. Although demonstrated to be superior to global learning in many facets 
of machine learning, it may lose some important global information. The 
question here is thus, can reliable global information, independent of specific 
model assumptions, be combined into local learning? This question clearly 
motivates a hybrid learning of two largely different schools of approaches, 
which is also the focus of this book. 


1.4 Major Contributions 

In this book, we aim to describe a hybrid learning scheme to combine two 
different paradigms, namely global learning and local learning. Within this 
scheme, we propose a hybrid model, named the Maxi-Min Margin Machine 
(M 4 ), demonstrated to contain both the merits of global learning in repre¬ 
senting data and the advantages of local learning in handling tasks directly 
and effectively. Moreover, adopting the viewpoint of local learning, we also 
introduce a global learning model, called the Minimum Error Minimax Prob¬ 
ability Machine (MEMPM), which does not assume specific distributions on 
data and thus distinguishes itself from traditional global learning approaches. 
The main models discussed in this book are briefly described as follows. 

• The Maxi-Min Margin Machine model , a hybrid learning framework suc¬ 
cessfully combining global learning and local learning 
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o A unified framework of many important models 

As will be demonstrated, our proposed hybrid model successfully uni¬ 
fies both important models in local learning, e.g. the Support Vector 
Machines [4], and significant models in global learning, such as the 
Minimax Probability Machine (MPM) [19] and the Fisher Discrimi¬ 
nant Analysis (FDA) [9]. 

❖ With the generalization Guarantee 

Various statements from many views such as the sparsity and Mar¬ 
shall and Olkin Theory [20, 23] will be presented for providing the 
generalization bound for the combined approach, 
o A sequential Conic Programming solving method 

Besides the theoretic advantages of the proposed hybrid learning, we 
also tailor a sequential Conic Programming method [24, 29] to solve 
the corresponding optimization problem. The computational cost is 
shown to be polynomial and thus the proposed M 4 model can be 
solved practically. 

The Minimum Error Minimax Probability Machine, a general global 
learning model 

o A worst-case distribution-free Bayes optimal classifier 

Different from traditional Bayes optimal classifiers, MEMPM does 
not assume distributions for the data. Starting with the Marshall 
and Olkin theory, this model attempts to model data under the mini¬ 
max schemes. It does not intend to extract exact information but the 
worst-case information from data and thus presents an important 
progress in global learning, 
o Derive an explicit error bound for future data 

Inheriting the advantages of global learning, the proposed general 
global learning method contains an explicit worst-case error bound 
for future data under a mild condition. Moreover, the experimental 
results suggest that this bound is reliable and accurate. 

❖ Propose a sequential Fractional Programming optimization 

We have proposed a Fractional Programming optimization method 
for the MEMPM model. In each iteration, the optimization is shown 
to be a pseudo-concave problem, which thus guarantees that each 
local solution will be the global solution in this step. 

The Biased Minimax Probability Machine (BMPM), a global learning 
method for biased or imbalanced learning 
o Present a rigorous and systematic treatment for biased learning tasks 
Although being a special case of our proposed general global learning 
model, MEMPM, this model provides a quantitative and rigorous 
approach for biased learning tasks, where one class of data is always 
more important than the other class. Importantly, with explicitly 
controlling the accuracy of one class, this branch model can precisely 
impose biases on the important class. 
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o Containing explicit generalization bounds for both classes of data 
Inheriting the good feature of the MEMPM model, this model also 
contains explicit generalization bounds for both classes of data. This 
therefore guarantees a good prediction accuracy for future data. 

The Local Support Vector Regression (LSVR), a novel regression model 
<> Provide a systematic and automatic treatment in adapting margins 
Motivated from M 4 , LSVR focuses on considering the margin setting 
locally. When compared to the regression model of SVM, i.e. the Sup¬ 
port Vector Regression (SVR), this novel regression model is shown 
to be more robust with respect to the noise of data in that it contains 
the volatile margin setting. 

o Incorporate special cases very much similar to the standard SVR 
When considering a consistent trend for all data points, the LSVR 
can derive special cases very much similar to the standard SVR. We 
further demonstrate that in a meaningful assumption, the standard 
SVR is actually the special case of our LSVR model. 

Support Vector Regression with Local Margin Variations 
Motivated from the local view of data, another variation of SVR is pro¬ 
posed. It aims to adapt the margin in a more explicit way. This model is 
similar to LSVR in the sense that they both adapt margin locally. 

We describe the relationship among our developed models in Fig. 1.5. 



AiLocal Learning 
ILGIobal Learning 

C:\1iniimim Error Minmax Probability Machine 
D:Biased Minimax Probability Machine 
E:Maxi-Min Margin Machine 
F:Local Support Vector Regression 
GiSupport Vector Regression with Margin Variations 


Fig. 1.5. The relationship among the developed models in this book 
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1.5 Scope 

This book states and refers to the learning first as statistical learning, which 
appears to be the current main trend of learning approaches. We then further 
restrict the learning in the framework of classification, one of the main prob¬ 
lems in machine learning. The corresponding discussions on different models 
including the conducted analysis of the computational and statistical aspects 
of machine learning are all subject to the classification tasks. Nevertheless, 
we will also extend the content of this book to regression problems, although 
it is not the focus of this book. 


1.6 Book Organization 

The rest of this book is organized as follows: 

• Chapter 2 

We will review different learning paradigms in this chapter. We will es¬ 
tablish a hierarchy graph attempting to categorize various models in the 
framework of local learning and global learning. We will then base this 
graph to describe and discuss these models. Finally, we motivate the 
Minimum Error Minimax Probability Machine and the Maxi-Min Mar¬ 
gin Machine. 

• Chapter 3 

We will develop a novel global learning model, called the Mininum Error 
Minimax Probability Machine. We will demonstrate how this new model 
represents the worst-case Bayes optimal classifier. We will detail its model 
definition, provide interpretations, establish a robust version, extend to 
nonlinear classifications, and present a series of experiments to demon¬ 
strate the advantages of this model. 

• Chapter 4 

We will present the Maxi-Min Margin Machine, which successfully com¬ 
bines two different but complementary learning paradigms, i.e. local 
learning and global learning. We will show how this model incorporates 
the Support Vector Machine, the Minimax Probability Machine, and the 
Fisher Discriminant Analysis as special cases. We will also demonstrate 
the advantages of Maxi-Min Margin Machine by providing theoretical, 
geometrical, and empirical investigations. 

• Chapter 5 

An extension of the proposed MEMPM model will be discussed in this 
chapter. More specifically, the Biased Minimum Minimax Probability Ma¬ 
chine will be discussed and applied into the imbalanced learning tasks. 
We will review different criteria for evaluating imbalanced learning ap¬ 
proaches. We will then base these criteria to tailor BMPM into this type 
of learning. Both illustrations on toy datasets and evaluations on real 
world imbalanced and medical datasets will be provided in this chapter. 
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• Chapter 6 

A novel regression model called the Local Support Vector Regression, 
which can be regarded as an extension from the Maxi-Min Margin Ma¬ 
chine, will be introduced in detail in this chapter. We will show that our 
model can vary the tube (margin) systematically and automatically ac¬ 
cording to the local data trend. We will show that this novel regression 
model is more robust with respect to the noise of data. Empirical eval¬ 
uations on both synthetic data and real financial time series data will 
be presented to demonstrate the merits of our model with respect to the 
standard Support Vector Regression. 

• Chapter 7 

In this Chapter, we show how to adapt the margin settings locally for 
the Support Vector Regression differently from the LSVR. We demon¬ 
strate how the local view of data can be widely used in various models 
or even differently applied in the same model. Empirical evaluations are 
also presented in comparison with other competitive models on financial 
data. 

• Chapter 8 

We will then summarize this book and conduct discussions on future 
work. 

We try to make each of these chapters self-contained. Therefore, in several 
chapters, some critical contents, e.g. model definitions or illustrative figures, 
having appeared in previous chapters, may be briefly reiterated. 
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Global Learning vs. Local Learning 


In this chapter, we conduct a more detailed and more formal review on two 
different schools of learning approaches, namely, the global learning and local 
learning. We first provide a hierarchy graph as illustrated in Fig. 2.1 in which 
we try to classify many statistical models into their proper categories, either 
global learning or local learning. Our review will also be conducted based on 
this hierarchy structure. To make it clear, we use filled shapes to highlight 
our own work in the graph. 

Global learning fits a distribution over data. If a specific mathematical 
model, e.g. a Gaussian model, is assumed on the distribution, this is often 
called generative learning, whose name implies that the mathematical formu¬ 
lation of the assumed model governs the generation of data in the learning 
task. To learn the parameters from the observations of data for the specific 
model, several schemes have been proposed. This includes Maximum Likeli¬ 
hood (ML) learning, which is easy to conduct but is less accurate, Conditional 
Likelihood (CL) learning, which is usually hard to perform optimization but 
is more effective, and Bayesian Average (BA) learning, which has a compara¬ 
tively short history but is more promising. As generative learning pre-assigns 
a specific model before learning, it often lacks the generality and thus may 
be invalid in many cases. This thus motivates the non-parametric learning, 
which still estimates a distribution on data but assumes no specific mathe¬ 
matical generative models. The common way in this type of learning is to 
locally fit over each observation a simple density and then sums all the local 
densities as the final distribution for data. Although in some circumstances, 
this approach is successful, it is criticized for requiring a huge quantity of 
training points and containing a large space complexity. Differently, in this 
book, we will demonstrate a novel global learning method, named Minimum 
Error Minimax Probability Machine (MEMPM). Although still in the frame¬ 
work of global learning, it does not belong to non-parametric learning, there¬ 
fore requiring no extremely heavy storage spaces. Moreover, it does not 
assume any specific distribution on data, which hence distinguishes itself 
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from the traditional global generative learning. As a critical contribution, 
MEMPM represents a distribution-free Bayes optimal classifier in a worst- 
case scenario. Furthermore, we will show that this model incorporates two 
important global learning approaches, Biased Minimax Probability Machine 
(BMPM) and Minimax Probability Machine (MPM) [29, 30]. Since all ap¬ 
proaches within the paradigm of global learning require summarizing the 
data information completely and globally, it thus may waste computational 
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resources and is widely argued to be less direct. This motivates the local 
learning which makes no attempt to model the data globally, but focuses on 
extracting only those information directly related to the task. This type of 
learning is often refereed to as discriminative learning in the context of classi¬ 
fications. One famous model among them is Support Vector Machine (SVM). 
With the task-oriented, robust, computationally tractable properties, SVM 
has achieved a great success and is considered as the current state-of-the- 
art classifier. Although local learning demonstrates superior performance to 
traditional global learning, it appears to situate itself at another extreme 
end, which totally discards the useful global information, e.g. the structure 
information of data. 

Our suggestion is that we should combine these two different but comple¬ 
mentary paradigms. Towards this end, we then propose a new model called 
Maxi-Min Margin Machine (M 4 ), which not only successfully employs the 
global structure information from data but also holds merits of local learning 
such as robustness and superior classification accuracies. As a critical contri¬ 
bution, M 4 , the hybrid learning model represents a general model successfully 
shown to contain both local learning models and global learning models as 
special cases. More specifically, it contains two significant and popular global 
learning models, i.e. Fisher Discriminant Analysis (FDA) [13] and Minimax 
Probability Machine [28, 29, 30] as special cases. Meanwhile, SVM, the local 
learning model can also be considered as one of its branches. In addition, 
M 4 also demonstrates a strong connection with MEMPM, the novel general 
global learning model. 

In the following, we first present the problem definition which will be used 
throughout this book. We then base Fig. 2.1 to provide introductions and 
comments for each type of learning model sequently. Finally, we summarize 
the review and conclude with the proposition of the hybrid framework, the 
objective of this book. 


2.1 Problem Definition 

Given a dataset D consisting of N observations, where each observation is 
of the form (z l5 z 2 ,..., z n , c) £ M, for 1 < i < n, c £ F, where F is a 
finite set), the basic learning problem is to construct a mapping rule or a 
function / from {zi, Z 2 , ■ ■ ■, z n j called features or attributes to the output 
c, denoted as the class variable, namely f(zi, z 2 ,..., z n ,0, D) —> c, where 0 
means the function parameters. The function / should be not only as accurate 
as possible to fit the observations D , but also can robustly predict the class 
for the new data. Sometimes, we also use 0 to denote the mapping model 
/ and its associated parameters. For simplicity, we often use 2 : to denote 
the n-dimensional variable {zi, 22 , • • ■, z n }. If we use Zj , we refer it to the 
j-th observation in D. Throughout this book, unless we provide statements 
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explicitly, and bold typeface will indicate a vector or matrix, while normal 
typeface will refer to a scale variable or the component of the vectors. 


2.2 Global Learning 


Global learning often describes the data by attempting to estimate a distribu¬ 
tion over variables (zi,Z 2 ,.... z n , c), denoted as p(z, c, 0\D). The estimated 
distribution can then be used to make predictions by calculating the proba¬ 
bility that a specific value of c will occur, when given an instance of features 
z. In more details, the decision rule or the mapping function can be described 
as: 


c= argmaxp(cfc|.D, z) = argmax / p(ck, 0\D, z)d0 . (2.1) 

CfcGF CfeGF J 


By employing Bayes theory, one can transform the above joint probability 
(the item inside the integral) into the following equivalent forms: 


p(c k ,0\D,z) 


p(c k ,z\D,O)p(0\D) 

E CfeG F /Pfe z\D,0)p(0\D)d0 ' 


( 2 . 2 ) 


Since the denominator in the above does not influence the decision in 
practice, the decision rule of Eq.(2.1) can be written into a relatively easily- 
calculated form: 


c= argmax / p(ck, z\D,0)p(0\D)d0 . (2.3) 

CfeGF J 

Depending on how the model 0 is assumed on D, global learning can 
be further divided into generative learning and non-parametric learning as 
elaborated in the following subsections. 


2.2.1 Generative Learning 

Generative learning often assumes a specific model on data D. For example, 
a Gaussian distribution is assumed to be the underlying model to generate 
D. In this case, the parameters 0 refer to the mean and covariance for the 
Gaussian distribution. There are many models which belong to this type of 
learning. Among them are Naive Bayes model [9, 26, 32], Gaussian Mixture 
Model [4, 15, 16, 33], Bayesian Network [19, 20, 21, 31, 40], Hidden Markov 
Model [2, 48], Logistic Regression [23], Bayes Point Machine [18, 36, 44], 
Maximum Entropy Estimations [22], etc. The key problem for generative 
learning is how to learn the parameters 0 from data. Generally, in the lit¬ 
erature of machine learning, three schemes, Maximum Likelihood learning, 
Conditional Likelihood learning and Bayesian Average learning, are engaged 
for estimating the parameters. We state these approaches one by one in the 
following. 
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2.2.1.1 Maximum Likelihood Learning &; Maximum A Posterior 
Learning 

Considering that it is not always easy to calculate the integral in Eq.(2.3), 
earlier researchers often try to compute some approximations of Eq.(2.3) 
instead. This motivates the Maximum Likelihood learning and Maximum A 
Posterior (MAP) learning [9, 40]. 

These learning methods replace Eq.(2.3) with the formulation below: 

c= argmaxp(c fe ,z|£>,0*) . (2.4) 

c fc eF 

In the above, how 0* are estimated, thus discriminates MAP from ML. 
In MAP, 0* are estimated as: 


0* = argmaxp(0|0) , (2.5) 

while in ML, the parameters are given as: 

0* = argmaxp(0|0) . (2.6) 


Observing Eq.(2.3), one can see that MAP actually enforces the approxi¬ 
mated conditional distribution over parameters as a delta function situating 
itself at the most prominent 0. Namely, 


p(0\D) 


1, if 0 = argmaxp(0|0) 
0, otherwise 


(2.7) 


For ML, it is even simpler. This can be observed by looking into the 
relationship between MAP and ML: 


argmaxp(0|D) = argmaxp(U|0)p(0) . (2.8) 

Thus, compared to MAP, ML omits the item p(0), the prior probability 
over the parameters. In practice, a model with a more complex structure 
may be more possible to cause over-fitting, which means the model can fit 
the training data perfectly while having a bad prediction ability on the test 
or future data. In this sense, discarding the prior probability, ML lacks the 
flexibility to favor simple models by conditioning the prior probability [5, 49]. 
On the other hand, MAP permits a regularization on the prior probability 
and thus contains potentials to resist over-fitting problems. 

When applied in practice, under independent, identically distributional 
data (i.i.d.) conditions, rather than directly optimizing the original form, ML 
estimations usually take the maximization on the log-likelihood, which can 
transform the multiplication form into an easily-solved additional one: 

N 

0* = arg maxp(0|0) = argmaxlogp(I?|0) = argmax E \ogp(zj\0). (2.9) 
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2.2.1.2 Maximum Conditional Learning 

Rather than computing the integral form, both the above ML learning and 
MAP learning seek to use one specific point 0* to calculate Eq.(2.3). The 
difference between them lies in how they estimate the specific parameter 
0*. Compared with the long history in using ML and MAP estimations, 
Maximum Conditional (MC) learning enjoys a short span of time but has 
achieved state-of-the-art performance in many domains such as speech recog¬ 
nition [4, 42, 53]. 

Maximum Conditional learning also focuses on adopting one certain 0* 
to simplify the computation of Eq.(2.3). Differently, the selection of 0* is 
based on maximizing a conditional likelihood defined as follows: 

0* = argmaxp(C|0, Z) , (2.10) 

where C = {c 1 , c 2 ,..., c^} is the vector formed by the class label of each 
observation in D, and Z = {z 1 , z 2 ,..., z N } corresponds to the data of the 
attributes (or features) in D. Similar to the relation between ML and MAP, 
MC can also plug in a prior probability into the above formulae for resisting 
over-fitting problems, i.e. 

0* = argmaxp(C|0, Z)p(Q) . (2.11) 

By maximizing the conditional likelihood, MC is thus more direct and 
classification-oriented. Note that only the conditional probability which is 
maximized above is directly related to the classification purpose. Maximizing 
other quantities as done in ML or MAP, possibly optimizes unnecessary infor¬ 
mation for classifications, which is wasteful and imprecise. However, although 
MC appears to be more precise, it is usually hard to conduct the optimiza¬ 
tion due to the involvement of the conditional item. Such an example can be 
seen in optimizing a tree-based Bayesian network [12]. Moreover, when there 
is missing information, the optimization of MC may even present a more 
tough problem in general, while in such circumstances, powerful Expectation 
Maximization (EM) techniques [27, 35] can easily be applied in ML. 

2.2.1.3 Bayesian Average Learning 

It is noted that in ML, MAP and MC, for the easy calculation of Eq.(2.3) 
one certain 0* is adopted for approximations. However, although one point 
estimation enjoys computational advantages in approximating Eq.(2.3), in 
practice it may be very inaccurate and in this sense may impair the prediction 
ability of global learning. Aiming to solve this problem, recent researches 
have suggested to use the Bayesian Average learning approaches. This type 
of approaches facilitates the computation of Eq.(2.3) by changing the integral 
into a summation form based on sampling methods, e.g. Markov Chain Monte 
Carlo methods [14, 25, 37, 38, 41]. 
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Following this trend, many models are proposed. Among them are Bayesian 
Point Machine [18, 36, 44] and Maximum Entropy Estimation [22]. Bayes 
Point Machine restricts the averaging of the parameters in the version space 
which denotes the space where the training data can be perfectly classified. 
This proposed method is reported to contain a better generalization ability 
within the global learning framework. But it is challenged to lack systematic 
ways to extend its applications into non-separable datasets, where the version 
space may include no candidate solutions. Maximum Entropy Estimation, on 
the other hand, seems to provide a more flexible and more systematic scheme 
to perform the averaging of models. By trying to maximize an entropy-like 
objective, Maximum Entropy Estimation demonstrates some characteristics 
of both global learning and local learning. However, only two small datasets 
are used to evaluate its performance. Moreover, the prior, usually unknown, 
plays an important role in this model, but has to be assumed beforehand. 


2.2.2 Non-parametric Learning 


In contrast with generative learning discussed in the above, non-parametric 
learning does not assume any specific global models before learning. There¬ 
fore, no risk will be taken on possible wrong assumptions on data. Con¬ 
sequently, non-parametric learning appears to set a more valid foundation 
than generative learning models. Typical non-parametric learning models in 
the context of classifications consist of Parzen Window estimation [10] and 
the widely used fc-Nearest-Neighbor model [7, 43]. We will discuss these two 
models in the following. 

The Parzen Window estimation also attempts to estimate a density among 
the training data. However it employs a totally different way. Parzen Window 
first defines an ?i-dimensional cell hypercube region R jy over each observation. 
By defining a window function: 


f 1, \uj\ < 1/2, j = 1,2,..., n 
\ 0, otherwise 


the density is then estimated as: 


p N {z) = 


1 ^ 1 { z — Zi} 

N^h^ W {~T^ ) 
2=1 x / 


( 2 . 12 ) 


(2.13) 


where hjy is defined as the length of the edge of Rn- 

From the above, one can observe that Parzen Window puts a local den¬ 
sity over each observation, the final density is then the statistical result of 
averaging all local densities. In practice, the window function can actually 
be general functions including the most commonly-used Gaussian function. 
Fig. 2.2 illustrates a density estimated by the Parzen Window algorithm. 

The fc-Nearest-Neighbor method can be cast as designing a special cell 
over each observation and then averages all the cell densities as the overall 
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Fig. 2.2. An illustration of Parzen Window estimation 
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density for data. More specifically, the cell volume Vn is designed as follows: 
let the cell volume be a function of the training data, by centering a cell 
around each point z J and increasing the volume until kw samples are con¬ 
tained, where kw depends on N. The local density for each observation is 
then defined as 


Pn( z j) 


k N /N 

V N 


(2.14) 


When used for classifications, the prediction is given by the class with the 
maximum posterior probability, i.e. 


c = a,rgmaxpN{ci\z) . 

Further, the posterior probability can be calculated as below: 

/ PN(ci,z) _ ( ki/N)/V 

PN[Cllz) ~ E PN(z,a) ~ Z(ki/N)/v ~ k ■ 

££F i£F 


(2.15) 


(2.16) 


Therefore, the prediction result is just the class with the maximum fraction 
of the samples in a cell. 

These non-parametric methods make no underlying assumptions on data 
and appear to be more general in real cases. However, using no parameters 
actually means using many parameters so that each parameter would not 
dominate other parameters (in the discussed models, the data points can 
be in fact considered as the “parameters”). In such a way, if one parameter 
fails to work, it will not influence the whole system globally and statistically. 
However, using many parameters also results in serious problems. One of 
the main problems is that the density is overwhelmingly dependent on the 
training samples. Therefore, to generate an accurate density, the number of 
samples needs to be very large (much larger than would be required if we per¬ 
form the estimation by generative learning approaches). What is even worse 
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is that the number of data will unfortunately increase exponentially with the 
dimension of data. Another disadvantage caused is its severe requirement for 
the storage, since all the samples need to be saved beforehand in order to 
predict new data. 

2.2.3 The Minimum Error Minimax Probability Machine 

Within the context of global learning, a dilemma seems existing: If we assume 
a specific model as in generative learning, it loses the generality; if we use 
instead non-parametric learning, it is impractical for high-dimension data. 
One question is then proposed, can we have an approach which does not 
require a large number of training samples for reducing complexities and also 
does not assume specific models for maintaining the generality? Towards this 
end, we propose Minimum Error Minimax Probability Machine (MEMPM) 
in this book. 

Unlike generative learning or non-parametric learning, Minimum Error 
Minimax Probability Machine does not try to estimate a distribution over 
data. Instead, it attempts to extract reliable global information from data and 
estimates parameters for maximizing the minimal possibility that a future 
data will fall into the correct class. More precisely, rather than seeking to 
find an accurate distribution, MEMPM focuses on studying the worst-case 
probability (which is relatively robust) to predict data. In terms of the style 
in making decisions, MEMPM is more like a local learning method due to 
its direct optimization for classification and the task-oriented characteristic. 
However, because MEMPM only summarizes global information from data 
(not a distribution) as well, we still locate it in the framework of global 
learning. 

The proposed MEMPM contains many appealing features. Firstly, it rep¬ 
resents a distribution-free Bayes optimal classifier in the worst-case scenario. 
A perfect balance is achieved by MEMPM in this way: No specific model is 
assumed on data, since it is distribution-free. At the same time, although in 
the worst-case scenario, it is also the Bayes optimal classifier which is only 
originally applicable in the cases with a known distribution. Another critical 
feature of MEMPM is that under a mild condition, it contains an explicit 
generalization bound. Furthermore, by exploring the bound, the recently- 
proposed promising model, Minimax Probability Machine is clearly demon¬ 
strated to be its special case. Importantly, based on specifying a bound for 
one class of data, a Biased Minimax Probability Machine is branched out 
from MEMPM, which will be shown to provide a rigorous and systematic 
treatment for biased classifications. We will detail the MEMPM model and 
BMPM model in the next chapter. 
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2.3 Local Learning 

Local learning adopts a largely different way to construct classifiers. This 
type of learning is even more task-oriented than Minimum Error Minimax 
Probability Machine and Maximal Conditional learning. In the context of 
classifications, only the final mapping function from the features z to c is 
crucial. Therefore, describing global information from data or explicitly sum¬ 
marizing a distribution whatever is conditional or joint, is a roundabout or 
intermediate step and therefore may be deemed wasteful or imprecise espe¬ 
cially when the global information cannot be estimated accurately. 

Alternatively, recent progress has suggested a local learning method, or 
well known as the discriminative learning method. The family of approaches 
directly pin-points the most critical quantities for classifications, while all 
other information less irrelevant to this purpose is simply omitted. Compared 
to global learning, no model is assumed and also no explicit global information 
will be engaged in this scheme. Among this school of methods are Neural 
Networks [1, 11, 17, 34, 39, 43], Gabriel Graph methods [3, 24, 54], large 
margin classifiers [8, 45, 46, 47] including Support Vector Machine (SVM), 
a state-of-the-art classifier which achieves superior performance in various 
pattern recognition fields. In the following, we will focus on introducing SVM 
in details. 


Support Vector Machines 


Support Vector Machine is established based on minimizing the expected 
classification risk as defined as follows: 


R(0)= [ l(z,c,0)d(p(z,c)) , (2.17) 

J Z,C 

where l(z,c,0) is the loss function. Similar problems occur in the global 
learning, since generally p{z , c) is unknown. Therefore, in practice, the above 
expected risk is often approximated by the so-called empirical risk: 


1 

J=1 


(2.18) 


The above loss function describes the extent on how close the estimated 
class disagrees with the real class for the training data. Various metrics can be 
used for defining this loss function, including the 0—1 loss and the quadratic 
loss [50]. 

However, considering only the training data may lead to the over-fitting 
problem again. In SVM, one big step in dealing with the over-fitting problem 
has been made, i.e. the margin between two classes should be pulled away 
in order to reduce the over-fitting risk. Fig. 2.3 illustrates the idea of SVM. 
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Two classes of data depicted as circles and solid dots are presented in this 
figure. Intuitively observed, there are many decision hyperplanes which can be 
adopted for separating these two classes of data. However, the one plotted in 
this figure is selected as the favorable separating plane, because it contains the 
maximum margin between two classes. Therefore, in the objective function 
of SVM, a regularization term representing the margin shows up. Moreover, 
as seen in this figure, only those filled points called support vectors mainly 
determine the separating plane, while other points do not contribute to the 
margin at all. In another word, only several local points are critical for the 
classification purpose in the framework of SVM and thus should be extracted. 

Actually, a more formal explanation and theoretical foundation can be 
obtained from the Structure Risk Minimization criterion [6, 52]. Therein, 
maximizing the margin between different classes of data is minimizing an 
upper bound of the expected risk, i.e. the VC dimension bound [52]. However, 
since the focus of this book does not lie in the theory of SVM, we will not go 
further to discuss the details about this. Interested readers can refer to [51, 
52], 

2.4 Hybrid Learning 

Local learning (or simply regarded as SVM) has demonstrated its advantages, 
such as its state-of-the-art performance (the lower generalization error), the 
optimal and unique solution, and the mathematical tractability. However, it 
does discard many useful information from data, e.g. the structure informa¬ 
tion from data. 

An illustrative example has been seen in Fig. 1.4. In the current state- 
of-the-art classifier, i.e. SVM, similar problems also occur. This can be seen 
in Fig. 2.4. In this figure, the purpose is to separate two catergories of data 
x and y. As observed, the classification boundary is intuitively observed to 
be mainly determined by the dotted axis, i.e. the long axis of the y data 
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(represented by IUs) or the short axis of the x data (represented by o’s). 
Moreover, along this axis, the y data are more possible to scatter than the x 
data, since y contains a relatively larger variance in this direction. Noting this 
“global” fact, a good decision hyperplane seems reasonable to lie closer to the 
x side (see the dash-dot line). However, SVM ignores this kind of “global” 
information, i.e. the statistical trend of data occurrence. The derived SVM 
decision hyperplane (the solid line) lies unbiasedly right in the middle of 
two “local” points (the support vectors).The above considerations directly 
motivate Maxi-Min Margin Machine. 



Fig. 2.4. A decision hyperplane with considerations of 
both local and global information 


2.5 Maxi-Min Margin Machine 

After examining the road-map of the learning models, especially the global 
learning and local learning, we have seen a strong motivation for combining 
two different but complementary schemes. More specifically, borrowing the 
idea from local learning by assuming no distribution on data would set a 
valid foundation for the learning models. Meanwhile, fusing robust global 
information, e.g. structure information, into learning models appears to be¬ 
nefit more on refining decisions in separating data. 

Our effort will be made in this direction. As will be detailed in Chap¬ 
ter 4, the hybrid learning model, Maxi-Min Margin Machine successfully plugs 
the global information into the learning and enjoys good features from both 
local learning and global learning. As seen in Fig. 2.1, the Maxi-Min Mar¬ 
gin Machine model has built up various connections with many models in 
the literature; it incorporates Support Vector Machine as a special case, 
which lies in the framework of local learning; it also includes Minimax 
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Probability Machine and Fisher Discriminant Analysis as direct spin-offs. 
Moreover, a strong link has been established between this model and Mini¬ 
mum Error Minimax Probability Machine. Moreover, empirical investigations 
have shown that this combined model outperforms both local learning model 
such as SVM and global learning models, e.g. MPM. 

In the next chapter, we will first present the Minimum Error Minimax 
Probability Machine which is a general global learning model. Following that, 
we then introduce the Maxi-Min Margin Machine and demonstrate its merits 
both theoretically and empirically. 
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A General Global Learning Model: MEMPM 


Traditional global learning, especially generative learning, enjoys a long and 
distinguished history, holding a lot of merits, e.g. a relatively simple opti¬ 
mization, and the flexibility in incorporating global information such as struc¬ 
ture information and invariance, etc. However, it is widely argued that this 
model lacks the generality for having to assume a specific model beforehand. 
Assuming a specific model over data is useful in some cases. However, the as¬ 
sumption may not always coincide with the true data distribution in general 
and thus may be invalid in many circumstances. In this chapter, we propose 
a novel global learning model, named Minimum Error Minimax Probability 
Machine (MEMPM), which is directly motivated from Marshall and OlKin 
Probability Theory [20, 24]. For classifying data correctly, this model focuses 
on estimating the worse-case probability, which is not only more reliable, 
but also more importantly provides no need for assuming specific models. 
Furthermore, this new model consists of several appealing features. 

First, MEMPM acutally presents a novel general framework for classifica¬ 
tions. As demonstrated later, MEMPM includes a recently-proposed promi¬ 
sing model Minimax Probability Machine as its special case, which is reported 
to achieve comparable performance to SVM. Interpretations from both view¬ 
points of the optimal thresholding problem and the geometry will be provided 
to show the advantages of MEMPM. Moreover, this novel model branches out 
another promising special case, named Biased Minimax Probability Machine 
(BMPM) [12] and extends its application into a type of important classifica¬ 
tions, i.e. biased classifications. 

Second, this model derives a distribution-free Bayes optimal classifier 
in the worst-case scenario. It thus distinguishes itself from the traditional 
global learning methods, or more particularly, the traditional Bayes optimal 
classifiers which have to assume a distribution on data and thus lack the 
generality in real cases. Furthermore, we will show that under some condi¬ 
tions, e.g. when a Gaussian distribution is assumed on data, the worst-case 
Bayes optimal classifier becomes the true Bayes optimal hyperplane. 
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Third, the MEMPM model contains an explicit performance indicator, 
namely an explicit upper bound on the probability of misclassification of 
future data. Moreover, we will demonstrate theoretically and empirically that 
MEMPM attains a smaller upper bound of the probability of misclassification 
than MPM, which thus implies the advantages of MEMPM over MPM. 

Fourth, although in general the optimization of MEMPM is shown to 
be a non-concave problem, empirically, it demonstrates a good concavity in 
the main “interest” region and thus can be solved practically. Furthermore, 
we will show that the final optimization problem involves solving a one¬ 
dimensional line search problem and thus results in a satisfactory solving 
method. 

This chapter is organized as follows. In the next section, we will first in¬ 
troduce the Marshall and Olkin Theory. We then present the main content 
of this chapter, the MEMPM model, including its definition, interpretations, 
the practical solving method, and the sufficient conditions for the conver¬ 
gence into the true Bayes decision hyperplane. Following that, we demon¬ 
strate a robust version of MEMPM. In Section 3.4, we seek to kernelize the 
MEMPM model to attack nonlinear classification problems. We then, in Sec¬ 
tion 3.5, present a series of experiments on synthetic datasets and real-world 
benchmark data sets. In Section 3.6, we analyze the tightness of the worst- 
case accuracy bound. In Section 3.7, we show that empirically MEMPM is 
often concave in the main “interest” region. In Section 3.8, we present the 
limitations of MEMPM and envision the possible future work. Finally, we 
summarize this chapter in Section 3.9. 


3.1 Marshall and Olkin Theory 

The Marshall and Olkin Theory can be described as follows: 

Theorem 3.1. [Marshall and Olkin Theory] The probability that a random 
vector y belongs to a convex set S can be bounded by the following formulation: 

sup Pr{y e S} = * , with d 2 = inf (y - y) T S y 1 (y - y) , (3.1) 

v~cb,b v ) 1 + d y^ s 

where the supremum is taken over all distributions for y containing the mean 
as y and the covariance matrix as S y 1 . 

The theory provides us with a possibility to assume no model, but bound 
the probability of misclassifying a point and consequently develop a novel 
classifier within the framework of global learning. More specifically, one can 
design a linear separating plane by replacing S with a half space associated 


x We assume S y to be positive definite for simplicity. Otherwise, we can always 
add a small positive amount to its diagonal elements to force its positive definition. 
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with this linear plane. To take the supremum can then be considered to 
bound the misclassification rate for one class of data. We in the following, 
first introduce the model definition and then show how this theory can be 
applied therein for deriving a distribution-free classifier. 


3.2 Minimum Error Minimax Probability Decision 
Hyperplane 

In this section, we first present the model definition of MEMPM while review¬ 
ing the original MPM model. We then in Section 3.2.2 interpret MEMPM 
with respect to MPM. In Section 3.2.3, we specialize the MEMPM model 
for dealing with biased classifications. In Section 3.2.4, we analyze the 
MEMPM optimization problem and propose a practical solving method. In 
Section 3.2.5, we address the sufficient conditions when the worst-case Bayes 
optimal classifier derived from MEMPM becomes the true Bayes optimal clas¬ 
sifier. In Section 3.2.6, we provide a geometrical interpretation for BMPM and 
MEMPM. 

3.2.1 Problem Definition 

The notation in this chapter will largely follow that of [16]. Let x and y 
denote two random vectors representing two classes of data with means and 
covariance matrices as {x 1 S x j and {y, Sy}, respectively, in a two-category 
classification task, where x , y, x, y £ R n , and S x , S y £ R nxra . 

Assuming {x, S x }, {y,S v } for two classes of data are reliable, MPM 
attempts to determine the hyperplane w T z = b (w £ R"\{0}, z £ R”, 
b £ R, and superscript T denotes the transpose) which can separate two 
classes of data with the maximal probability. The formulation for the MPM 
model is written as follows: 


max {da + (1 — 9)/3} , 

(3.2) 

a,/3,w^O,b 

s.t. inf Pr\w T x > b} > a , 

(3.3) 

x~(x,£ x ) 

inf Pr{w T y < b} > (3 , 

(3.4) 


y~(y,£y) 


where a and f3 indicate the worst-case classification accuracies of future data 
points for the class x and y , respectively, namely, the worst-case accuracy for 
classifying x data and y data. Future points z for which w T z > b are then 
classified as the class x ; otherwise they are judged as the class y. 8 £ [0,1] is 
the prior probability of the class x and 1 — 9 is thus the prior probability of 
the class y. Intuitively, maximizing 8a-\-(1 — 9)13 can be naturally considered 
as maximizing the expected worst-case accuracy for future data. In other 
words, this optimization leads to minimizing the expected upper bound of 
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the error rate. More precisely, if we change max{0a + (1 — 9)/3} to min{0(l — 
a) + (1 — 6) (1 — 0 )} and consider 1 — a as the upper bound probability that an 
x data is classified into class y (1 — (3 is similarly considered), the MEMPM 
model exactly minimizes the maximum Bayes error and thus derives the 
Bayes optimal hyperplane in the worst-case scenario. In comparison, MPM 
assumes the equal worst-case probability for both classes, i.e. it forces a = (3. 
Obvisouly, this is inappropriate since it is unnecessary that the worst-case 
accuracies are presumed equal. However, even in such a constrained way, 
MPM is reported to achieve comparable performacne to SVM, a current 
state-of-the-art classifier. Therefore, the generalized case of MPM, namely, 
MEMPM may be expected to be more pomising. This will be empirically 
demonstrated in the experimental part of this chapter. 

3.2.2 Interpretation 

We interpret MEMPM with respect to MPM in this section. First, it is evident 
that if we presume a = (3, the optimization of MEMPM degrades to the 
MPM optimization. This would mean that MPM is actually a special case of 
MEMPM. 

An analogy to illustrate the difference between MEMPM and MPM can 
be seen in the optimal thresholding problem. Fig. 3.1 illustrates this analogy. 
To separate two classes of one-dimensional data with density functions as p\ 
and , respectively, the optimal thresholding is given by the decision plane 
in Fig. 3.1(a) (assuming that the prior probabilities for two classes of data 
are equal). This optimal thesholding corresponds to the point minimizing the 




(a) (b) 


Fig. 3.1. An analogy to illustrate the difference between MEMPM 
and MPM with equal prior probabilities for two classes. The optimal 
decision plane corresponds to the intersection point, where the error 
(1 — a) + (1 — j3) is minimized (or the accuracy a + /3 is maximized) 
as implied by MEMPM, rather than the one where a is equal to f3 as 
implied by MPM 
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error rate (1 — a) + (1 — /?) or maximizing the accuracy a + (3, which is exactly 
the intersection point of two density functions (1 — a represents the area of 
135°-line filled region and 1 — /? represents the area of 45°-line filled region). 
On the other hand, the thresholding point to force a = /3 is not necessarily 
the optimal point to separate these two classes. 

It should be clarified that the MEMPM model assumes no distributions. 
This distinguishes the MEMPM model from the traditional Bayes optimal 
thresholding method which has to make specific assumptions on data distri¬ 
bution. On the other hand, although MEMPM minimizes the upper bound 
of the Bayes error rate of future data points, as shown later in Section 3.2.5, 
it will represent the true Bayes optimal hyperplane under some conditions, 
e.g. when a Gaussian distribution is assumed on data. 2 

3.2.3 Special Case for Biased Classifications 

The above discussion only covers the unbiased classification tasks, which does 
not favor one class over the other class intentionally. However, another im¬ 
portant type of pattern recognition tasks, namely biased classification, arises 
very often in practice. In this scenario, one class is usually more important 
than the other class. Thus a bias should be imposed towards the important 
class. Such typical example can be seen in the diagnosis of epidemical dis¬ 
ease. Classifying a patient who is infected with a disease into an opposite 
class results in serious consequence. Thus in this problem, the classification 
accuracy should be biased towards the class with disease. In other words, we 
would prefer to diagnose the person without the disease to be the infected 
case rather than the other way round. 

We in the following demonstrate that MEMPM actually contains a special 
case we call Biased Minimax Probability Machine for biased classifications. 
We formulate this special case as: 

max a , 
a,(3,w^0 ,b 

s.t. inf Pr{w T x > b} > a , 

x~(x,U x ) 

inf Pr{w T y < b} > (3 o , 

y~(y.S y ) 


2 Another interpretation of the difference between MEMPM and MPM can be 
stated from the viewpoint of Game Theory. MPM can be regarded as a non- 
cooperative competitive game. In this game, each player (class) tries to maximize 
its individual benefit, i.e. a. The competition leads to each class obtaining the same 
benefit when all classes fulfill a kind of equilibrium. However, in the game theory, 
many models, e.g. the prisoners’ dilemma, Counot Model and the tragedy of the 
commons [21], have stated that maximizing individual benefit does not lead to 
maximizing the global optimum. Our model, on the contrary, can be considered as 
a kind of cooperative game. It achieves the global optimum through cooperation. 
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where /3 0 is a pre-specified positive constant, which represents an acceptable 
accuracy level for the less important class y. 

The above optimization utilizes a typical setting in biased classifications, 
i.e. the accuracy for the important class (associated with x) should be as high 
as possible, if only the accuracy for the less important class (associated with 
y) maintains at an acceptable level specified by the lower bound /3q (which 
can be set by users). 

With quantitatively plugging a specified bias /3q into classifications and 
also containing an explicit accuracy bound a for the important class, BMPM 
provides a more direct and elegant way for biased classifications. Compa¬ 
ratively, to achieve a specified bias, traditional biased classifiers such as the 
Weighted Support Vector Machine [23] and the Weighted fc-Nearest Neighbor 
method [18] usually adapt different costs for different classes. However, due 
to the difficulties in building up quantitative connections between the cost 
and the accuracy, 3 for imposing a specified bias, these methods need resort 
to the trial and error procedure to attain suitable costs which are generally 
indirect and lack rigorous treatments. 


3.2.4 Solving the MEMPM Optimization Problem 

In this section, we will propose to solve the MEMPM optimization prob¬ 
lem. As will be demonstrated shortly, the MEMPM optimization can be 
transformed into a one-dimensional line search problem. More specifically, 
the objective function of the line search problem is implicitly determined by 
dealing with a BMPM problem. Therefore, solving the line search problem 
corresponds to solving a Sequential Biased Minimax Probability Machine 
(SBMPM) problem. Before we proceed, we first introduce how to solve the 
BMPM optimization problem. 


3.2.4.1 Solving the BMPM Optimization Problem 

First, we describe Lemma 3.2 which is developed in [16]. 

Lemma 3.2. Given w ^ 0 and b, such that w T y < b and (3 £ [0,1), the 
condition: 


inf Pr\w T y < b} > 3 

V~<V,By) 


holds if and only if b 


w r y > k(/ 3)\Jw T SyW with k(/3) = 


0 

1-0 ■ 


The lemma can be proved according to the Marshall and Olkin Theory 
and the Lagrangian Multiplier theory. 


3 Although cross validations could be used to provide empirical connections, they 
are problem-dependent and are usually slow procedures as well. 
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Proof. In Marshall and Olkin Theory, if we define S = {w T y > b}, the 
theorem is changed to: 

sup Pr{w T y >b}= 1 with d 2 = inf {y - y) T (y - y) . 

y~{y,Z y } 1 + « W Ty>b 


We next show that d can be obtained as follows: 


d 2 = inf (y - y) T £ \y - y) 

w T y>b 


max (b — w T y , 0) 2 

W T SyW 


This can be proved by using the Lagrangian multiplier method as follows: 


(1) If w T y < b. 

Denoting p T = w T , g = S y 1 ^ 2 (y — y), and q = b — w T y, one 
can write d 2 = inf g T g. One can obtain g by introducing Lagrangian 

p T iv>q 

multiplier: 

{g, X} = argmin arg ma x{g T g + X(q - p T g)}, 

9 A 


where the multiplier A > 0. Therefore, one can get the following equalities: 

X P t 

9 = ~ 2 , q = p g- 


Since w l y < 6, one can easily obtain q > 0. One can further obtain: 


A = 


2 q 


dp 


t i g I 

p l p p L p 


Finally, this leads to the following equation: 

d 2 = inf ( y - y) T Z y \y -y)= ^ T ^, ^ ■ 

vu T y>b y W i XJyW 


(2) If w T y > b. 

In this case, we can only have y = y. Therefore, d = 0. 

By integrating the above, we thus complete the proof of this theorem. 

By using Lemma 3.2 we can transform the BMPM optimization problem 
as follows: 


s.t. 


max a , 

ol, w t^O, b 


—b + w T x > n(a) \Jw T S x w , 
b - w T y > k(/3 0 ) Jw t S„w 


JyLV , 


(3.5) 

(3.6) 

(3.7) 
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where «(a) = yj re(/?o) = Eq.(3.7) is directly obtained from 

Eq.(3.4) by using Lemma 3.2. Similarly, by changing w T x > b to w T (—x) < 
—6, Eq.(3.6) can be obtained from Eq.(3.3). 


1 — ol ’ 


ot. 


From Eqs.(3.6) and (3.7), we get 



w T S y w < b < w T x — n{a) \Jw T U x w . (3.8) 


If we eliminate b from this inequality, we obtain: 



We observe that the magnitude of w does not influence the solution of 
Eq.(3.9). Moreover, we can assume x ^ y; otherwise, if x = y, the mini¬ 
max machine does not have a physical meaning. In this case, Eq.(3.9) may 
even have no solution for every /3 q yf 0, since the right hand side would be 
always positive provided that w yf 0. Thus in the extreme case, (3 and a have 
to be zero, which means the worst-case misclassification are always zero. 

Without loss of generality, we can set w? T (x — y) = 1. Thus the problem 
can be further changed as: 


(3.10) 


max a , 



s.t. 1 > n(a)\/w T £ x w + w 
w T (x — y) = 1 . 


(3.11) 

(3.12) 


Since S x can be assumed as positive definite (otherwise, we can always 
add a small positive amount to its diagonal elements and make it positive 
definite), from Eq.(3.11) we can obtain: 



(3.13) 


Because n(a) increases monotonically with a , maximizing a is equivalent 
to maximizing n(a), which further leads to: 


1 - n(f3 0 )y/w T Z:yW 


max 

Wy^ 0 


v 7 w T S x w 

s.t. w T (x — y) = 1 . 


This kind of optimization is called Fractional Programming (FP) prob¬ 
lem [13, 19, 26]. To elaborate further, this optimization is equivalent to solving 
the following fractional problem: 
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subject to w € A = {w\w T (x — y) = 1}, where f(w) = 1 — k(/3q)^/ w T U y w, 
g(w) = y/w T U x w. 

Theorem 3.3. The Fractional Programming problem Eg. (3.14) associated 
with the BMPM optimization is a pseudo-concave problem whose every lo¬ 
cal optimum is the global optimum. 

Proof. It is easy to see that the domain A is a convex set on R n , f(w) 
and g(w) are differentiable on A. Moreover, since S x and S y can be both 
considered as positive definite matrices, f(w) is a concave function on A and 
g(w) is a convex function on A. Then f(w)/g(w) is a concave-convex FP 
problem. Hence it is a pseudo-concave problem [26]. Therefore, every local 
maximum is the global maximum [26]. 

To handle this specific FP problem, many methods such as the parametric 
method [26], the dual FP method [7, 25], and the concave FP method [6] can 
be used. A typical Conjugate Gradient method [2] in solving this problem will 
have a worst-case 0(n 3 ) time complexity. Adding the time cost to estimate 
x, y, S x , and S y , the total cost for this method is 0(n 3 + Nn 2 ), where N is 
the number of data points. This complexity is in the same order as the linear 
Support Vector Machines [27] and the linear MPM [16]. 

In this chapter, the Rosen gradient projection method [2] is used to find 
the solution of this pseudo-concave FP problem, which is proved to converge 
to a local maximum with a worse-case linear convergence rate. Moreover, the 
local maximum will exactly be the global maximum in this problem. 

3.2.4.2 Sequential BMPM Optimization Method for MEMPM 

We now turn to solving the MEMPM problem. Similar to Section 3.2.4.1, we 
can base on Lemma 3.2 to transform the MEMPM optimization as follows: 

max {da + (1 — 6)(3} , (3.15) 

ot,/3,w^0,b 

s.t. —b + w T x > n(a)\/w T £ x w , (3.16) 

b — w T y > k(P) w T £ y w . (3-17) 

Using the similar analysis as in Section 3.2.4.1, we can further transform 
the above optimization into 

max {da + (1 — d)/3} , (3.18) 

ot,(3,wy^0 

s.t. 1 > n(a)\/w T S x w + k({3) w T E y w , (3.19) 

w T (x — y) = 1 . (3.20) 

In the following we provide a lemma to show that the MEMPM solution 
is actually attained on the boundary of the set formed by the constraints of 
Eqs.(3.19) and (3.20). 
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Lemma 3.4. The maximum value ofda+ (1 — 9)/3 under the constraints of 
Eqs.(3.19) and (3.20) is achieved when the right hand side of Eg. (3.19) is 
strictly equal to 1. 

Proof. Assume the maximum is achieved when 

1 > k(P) Jw T £ v w + n(a)\/w T £ x w . 


A new solution constructed by increasing a or k(q) by a small positive 
amount, 4 and maintaining /3, w unchanged will satisfy the constraints and 
will be a better solution. 


By applying Lemma 3.4 we can transform the optimization problem 
Eq.(3.18) under the constraints of Eqs.(3.19) and (3.20) as follows: 


s.t. 


Ok 2 (a) 
K 2 (a) + 1 
w T (x-y) = 1 , 


max 

/3,'wy^O 


(1 - 0)/3 


(3.21) 

(3.22) 


where 


n(a) = 


l-K^y/w^Sy 
\Jw T S x w 


In Eq.(3.22), if we fix (3 to a specific value within [0,1), the optimization 
is equivalent to maximizing K 2 (a)/re 2 (a) + 1 and further equivalent to max¬ 
imizing k(cx), which is exactly the BMPM problem. We can then update (3 
according to some rules and repeat the whole process until an optimal f3 is 
found. This is also the so-called line search problem [2, 1]. More precisely, 
if we denote the value of optimization as a function /(/?), the above proce¬ 
dure corresponds to finding an optimal f3 to maximize /(/?). Instead of using 
an explicit function as in traditional line search problems, the value of the 
function here is implicitly given by a BMPM optimization procedure. 

Many methods can be used to solve the line search problem. In this 
chapter, we use the Quadratic Interpolation (QI) method [2]. As illustrated 
in Fig.3.2, QI finds the maximum point by updating a three-point pattern 
(/?i, /? 2 , Ps) repeatedly. The new /3 denoted by f3 ne w is given by the quadratic 
interpolation from the three-point pattern. Then a new three-point pattern 
is constructed by (3 ne w and two of f3i,(32,03- This method can be shown to 
converge superlinearly to a local optimum point [2]. Moreover, as shown in 
Section 3.7, although MEMPM generally cannot guarantee its concavity, em¬ 
pirically it is often a concave problem. Thus the local optimum will be often 
the global optimum in practice. 

4 Since k(o) increases monotonically with a, increasing a by a small positive 
amount corresponds to increasing n(a) by a small positive amount. 
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Fig. 3.2. A three-point pattern and quadratic line search method. 
A /?new is obtained and a new three-point pattern is constructed 
by /3new and two of /3i, fa and j3z 


Until now, we do not mention how to calculate the intercept b. From 
Lemma 3.4, we can see that the inequalities Eqs.(3.16) and (3.17) will become 
equalities at the maximum point (w*, &*). The optimal b will thus be obtained 
by 


&* = wjx — K(a*)y -inj-ET eW* = wjy + wj£ y w* . (3.23) 

3.2.5 When the Worst-case Bayes Optimal Hyperplane Becomes 
the True One 

As discussed, the MEMPM derives the worst-case Bayes optimal hyperplane, 
thus it is interesting to dig out on what conditions the worst-case optimal 
one changes into the true optimal one. 

In the following we demonstrate two propositions: the first is that when 
data are assumed under some distributions, e.g. Gaussian distribution, the 
MEMPM leads to the Bayes optimal classifier; the second is that when applied 
into high-dimensional classification tasks, the MEMPM can be adapted to 
converge into the true Bayes optimal classifier under the Lyapunov condition. 

To introduce the first proposition, we begin with assuming data distribu¬ 
tion as a Gaussian distribution. 

Assuming x ~ N(x, U x ) and y ~ N(y, E y ), Eq.(3.3) becomes: 
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inf Pr{w T x > b} = Pr x ~ N(x r){w T x > b} 

x^N (x ,£ x ) 


= Pr{N{ 0,1) > 


b — w T x 


= 1 


s/w T E x W 
b — w T x \ 


yJw T S x W 


= <P 


—b + w T x 

\J w T S x w 


> a 


(3.24) 


where <P(z ) is the cumulative distribution function for the standard normal 
Gaussian distribution defined as: 


<P{z) = Pr{N( 0,1) <z} = 


exp 


—s 

~Y 


ds. 


x/27r J-c 

Due to the monotonic property of <P(z ), we can further write Eq.(3.24) as: 

—b + w T x > (a) \Jw T 2J x w . 

Constraint Eq.(3.4) can be reformulated to a similar form. The optimization 
Eq.(3.2) is thus changed as: 


max {9a + (1 — 9)(3} , 

oi,/3,wy^O,b 

s.t. —b + w T x > (a) \Jw T £ x w , 

b — w T y > <? _1 (/3) w T S y w . 


(3.25) 

(3.26) 


The above optimization is nearly the same as Eq.(3.2) subject to the con¬ 
straints of Eqs.(3.3) and (3.4) except that, n{a) is equal to <P~ l {a ), instead 
of Thus, it can be similarly solved based on the Sequential Biased 

Minimax Probability Machine method. 

On the other hand, the Bayes optimal hyperplane corresponds to the one, 
w T z = b , which minimizes the Bayes error: 

mm 9Pr x ^ N{X ' Sai) {w T x < b} + (1 - 9)Pr y ^ N{ ^, l:y) {w T y > b) (3.27) 

The above is exactly the upper bound of 9a + (1 — 9)(3. From Lemma 3.4 we 
can know that Eq.(3.26) will eventually become equalities. Traced back to 
Eq.(3.24), the equalities imply that a and (3 will achieve their upper bounds 
respectively. Therefore, with the Gaussian distribution assumption on data, 
the MEMPM derives the optimal Bayes hyperplane. 

We propose Proposition 3.5 to extend the above analysis to general dis¬ 
tribution assumptions. 













3.2 Minimum Error Minimax Probability Decision Hyperplane 


41 


Proposition 3.5. If the distribution of the normalized random variable 

T T— 

w x — W X 

\JW T S X W 

denoted as NS, is independent of w, as the case in Gaussian distribution, 
the similar MEMPM version as in Gaussian distribution assumption will be 
easily derived, except that z ) is changed as Pr{NS{ 0,1) < z}. In such 
case, minimizing the Bayes error bound will exactly minimize the true Bayes 
error. 

Before presenting Proposition 3.7, we first introduce the Central Limit 
Theorem under the Lyapunov condition [5]. 

Theorem 3.6. Let x n be a sequence of independent random variables defined 
on the same probability space. Assume that x n has finite expected value p, n 

n 

and finite standard deviation cr n . We define ^ erf. Assume that the 

i —1 
n 

third central moment r„ = E(\x n — p n \ 3 ) is finite for every n, and that 

i =1 

lim ( r n /s n ) = 0 (This is the Lyapunov condition). The sum S n = X\ + ... + 

n—>oo 

x n converges towards a Gaussian distribution. 

One interesting finding directly elicited from the above Central Limit 
Theorem is that, if the component variable Xi of a given n-dimensional ran¬ 
dom variable x satisfies the Lyapunov condition, the sum of weighted com¬ 
ponent variables Xi, 1 < i < n, namely, w T x tends to be a Gaussian distri¬ 
bution, as n grows. 5 This shows that, under the Lyapunov condition, when 
the dimension n grows, the hyperplane derived by MEMPM with Gaussian 
assumption tends to be the true Bayes optimal hyperplane. In this case, the 
MEMPM using ^ -1 (a), the inverse function of the normal cumulative dis¬ 
tribution, instead of \J a/(l — a), will converge to the true Bayes optimal 
decision hyperplane in the high-dimensional space. We summarize the ana¬ 
lysis into Proposition 3.7. 

Proposition 3.7. If the component variable Xi of a given n-dimensional ran¬ 
dom variable x satisfies the Lyapunov condition, the MEMPM hyperplane de¬ 
rived by using , P~ 1 (a) the inverse function of normal cumulative distribution, 
will converge to the true Bayes optimal one. 

The underlying justifications in the above two propositions root in the 
fact that the generalized MPM is exclusively determined by the first and sec¬ 
ond moments. These two propositions actually emphasize the dominance of 
the first and second moments in representing data. More specifically, Propo¬ 
sition 3.5 hints that the distribution is only decided by up to the second 

5 Some techniques such as Independent Component [8] can be applied to decor¬ 
relate the dependence among random variables beforehand. 
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moment. The Lyapunov condition in Proposition 3.7 also implies that the 
second order moment dominates the third order moment in the long run. It 
also deserves attention that with the fixed mean and covariance, the distribu¬ 
tion of Maximum Entropy Estimation is the Gaussian distribution [14]. This 
would once again suggest the usage of ^ - 1 (a) in the high-dimensional space. 

3.2.6 Geometrical Interpretation 

In this section, we first provide a parametric solving method for BMPM, then 
demonstrate that this parametric method actually enables a nice geometrical 
interpretation for both BMPM and MEMPM. 

3.2.6.1 A Parametric Method for BMPM 

According to the parametric method, the fractional function can be itera¬ 
tively optimized in two steps [26]: 

Step 1. Find w by maximizing f(w) — \g(w) in the domain A, where A £ R. 
is the newly introduced parameter. 

Step 2. Update A by f(w)/g(w). 

The iteration of the above two steps will guarantee to converge to the local 
maximum which is also the global maximum in our problem. In the following, 
we adopt a method to solve the maximization problem in Step 1. Replacing 
f{w ) and g(w), we expand the optimization problem as: 


max |l — k(/ 3 0 )\Jw T UyW — A \Jw T U x w j ,s.t. w T (x — y) = 1. (3.28) 

Maximizing Eq.(3.28) is equivalent to min,,, k(/3q) yjw T S y w + X\/ w T S x w 
under the same constraint. By writing w = Wq + Fu, where Wq ~ (x — y)/ 
|| x — y ||| and F £ R" x ( n_1 ) is an orthogonal matrix whose columns span 
the subspace of vectors orthogonal tox — y, an equivalent form (a factor 1/2 
over each term has been dropped) to remove the constraint can be obtained: 

h™ nl 7 ? -1 \\^x 1 ^ 2 (w a + Fu)\\l + £ + ^ \\Sy 1 ^ 2 (w Q + Fu)\\l 1 , 

u,v> o,c>o { y 4 J 

(3.29) 

where 77 , £ £ R. This optimization form is very similar to the one in Minimax 
Probability Machine [15] and can also be solved by using an iterative least- 
squares approach. 


3.2.6.2 A Geometrical Interpretation for BMPM and MEMPM 

The parametric method actually enables a nice geometrical interpretation of 
BMPM and MEMPM in a fashion similar to that of MPM in [16]. Similarly, 







3.2 Minimum Error Minimax Probability Decision Hyperplane 


43 


we assume x ^ y for the meaningful classification and also assume that U x 
and S y are positive definite for the purpose of simplicity. 

By using the 2-norm definition of a vector 2 : ||,z|| 2 = max{it T 2 ; : ||k ,||2 < 
1}, we can express Eq.(3.28) in its dual form: 

r* := min max { Xu T + K(/3o)v T U y l ^ 2 w + r(l — w T (x — y))} 

uv L J 

S.t. ||lt|| 2 < 1 , ||u ||2 < 1 ■ 

We change the order of the min and max operators and consider the min: 
min {A u T + k((3o)v t S y 1 ^ 2 w + r( 1 — w T (x — y))} 

= f T, if tx — A S x 1/2 u = ry + K(j3 0 )£ y 1/2 v ; 

\ — 00 , otherwise. 

Thus, the dual problem can be further changed to: 

max t : ||u|| 2 < 1, |H| 2 < 1,tx - A = ry + k((3 0 )£ v 1/2 v. (3.30) 

T,U,V 

By defining t := 1/r we rewrite the dual problem as: 
min £\x — \H x ^ 2 u — y + K(l3 0 )E y 1/2 v, ||tt|j 2 < Z, IMI 2 < Z ■ (3.31) 

When the optimum is attained, we have 

r* = X\\S x 1/2 w4 2 + n(p 0 )\\U y 1 ^ 2 w ^\\2 = 1/Z. . (3.32) 

We consider each side of Eq.(3.31) as an ellipsoid centered at the means 
x and y and shaped by the weighted covariance matrices XS x and k(/3 o)2J y 
respectively: 


H x (£) = {x = x + A Z x 1/2 u : ||u|| 2 < £}, (3.33) 

Hy(i) = {y = y+ n(Po)£y 1/2 v : ||u|| 2 < £}. (3.34) 

The above optimization involves finding a minimum l for which two ellip¬ 
soids intersect. For the optimum £, these two ellipsoids would be tangent to 
each other. We further note that, according to Lemma 3.4, at the optimum, 
A*, which is maximized via a series of the above procedures, would satisfy 

1 = A*||-£'x 1/2 'U’*|| 2 + n(f3o)\\Sy 1/2 w it \\‘2 = T* = 1/e. , (3.35) 

=>4 = 1- (3.36) 

This means that the ellipsoid for the class y finally changes to the one 
centered at y, whose Mahalanobis distance to y is exactly equal to k(/3q). 
Moreover, the ellipsoid for the class x would be the one centered at x and 
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tangent to the ellipsoid for the class y. In comparison, for MPM, two el¬ 
lipsoids grow with the same speed (with the same n(a) and «(/?)). On the 
other hand, since MEMPM corresponds to solving a sequence of BMPMs, 
it similarly leads to a hyperplane tangent to two ellipsoids, which achieves 
to minimize the maximum of the worst-case Bayes error. Moreover, it is not 
necessarily attained in a balanced way as in MPM, i.e. two ellipsoids do not 
necessarily grow with the same speed and hence probably contain the unequal 
Mahalanobis distance from their corresponding centers. This is illustrated in 
Fig. 3.3. 


Data:Ciass.r depicted as+’s and Classy depicted as o’s 



Fig. 3.3. The Geometrical interpretation of MEMPM and BMPM. Finding 
the optimal BMPM hyperplane corresponds to finding the decision plane 
(the black dashed line) tangent to an ellipsoid (the inner dashed ellipsoid 
on the y side) , which is centered at y, shaped by the covariance S y and 
whose Mahalanobis distance to y is exactly equal to re(/3o) (re(/3o) = 1-28 
in this example). The worst-case accuracy a for x is determined by the 
Mahalanobis distance re (re = 5.35 in this example), at which an ellipsoid 
(centered at x and shaped by S x ) is tangent to that re(/3o) ellipsoid, i.e. the 
outer dahsed ellipsoid on the x side. In comparison, MPM tries to find out 
the minimum equality-constrained re, at which two ellipsoids for x and y 
intersect (both dotted ellipsoids with re = 2.77). For MEMPM, it achieves 
a tangent hyperplane in a non-balanced fashion, i.e. two ellipsoids may not 
attain the same re but are globally optimal in the worst-case setting (see 
the solid ellipsoids) 
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3.3 Robust Version 


In the above, the estimates of means and covariance matrices are assumed 
reliable. We now consider how the probabilistic framework in Eq.(3.2) changes 
against the variation of the means and covariance matrices: 


max {da + (1 — d)(3\ , 

(3.37) 

Oi,p, b 

inf Pr{w T x > b} > a,V(x, E x ) £ X , 

(3.38) 

£C~(CC, A-a.) 

inf Pr{w T y < b} > /3, V(y, E y ) £ Y , 

v~(y,E y ) 

(3.39) 


where X and Y are the sets of means and covariance matrices and are the 
subsets of KxP+, where P+ is the set of nxn symmetric positive semidefinite 
matrices. 

Motivated by the tractability of the problem and from the statistical view, 
a specific setting of X and Y is proposed in [16]. However, they consider the 
same variations of the means for two classes, which is easy to handle but less 
general. Now, considering the unequal treatment of each class, we propose 
the following setting which is in a more general and complete form: 

X = {(®, E x ) | (x - x^E^ix - x°) < v x , E x e || E x - -EJ’Hf < p x ) , 

Y={(y, E y ) | (y - y°)Z y ~\y - y°) < v 2 y , E y £ \\E y — 27 y °|| F < Py } , 

where x°, E^ x are the “nominal” means and covariance matrices obtained 
through estimating. Parameters v x , u y , p x , and p y are positive constants. 
The matrix norm is defined as the Frobenius norm: ||Af||^. = Tr(Af T Af). 

With the assumption that variations of the means for two classes are the 
same, the parameters v x and v y are required equal in [16]. This may enable 
the direct usage of the MPM optimization into its robust version. However, 
the assumption may not be true in real cases. Moreover, in MEMPM, this 
requirement is also not necessary and inappropriate. This will be later demon¬ 
strated in the experiment. 

By applying the results from [16], we obtain the robust MEMPM as: 


max {da + (1 — 6)0} 
a, /3, w t^O, b 


s.t. -b + w T x° > (n(a) + v x )\/w T (E x ° + p x I n )w, 
b - w T y° > (k(/ 3) + v y ) yJw T (E y ° + p y I n )w. 
Analogously, we transform the above optimization problem as: 


max 6 - 


l{a) 


a,P,w{^0 1 + 

s.t. w T (x° — y°) = 1, 


+ (1 - e)p , 


(3.40) 

(3.41) 
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where K r (a) = max f ^^'>+^yW w +P « J ) an( j thus can be 
y ^/w T y(Z: a . a +p a> I n )w J 

solved by the SBMPM method. The optimal b is therefore calculated by: 


6* = w* T x° - (/t(a*) + u x )\Jw* T (£ x ° + p x I n )w * 

= wJy Q + (k(/3*) + v y )^/w* T (£ y ° + p y I n )w* . 

Remarks. Interestingly, if MPM is treated with unequal robust parameters 
v x and v y , it leads to solving an optimization similar to MEMPM, since 
n(a) + v x will not be equal to «(a) + v v . In addition, similar to the robust 
MPM, when applied in practice, the specific values of v x , v yi p x and p y can 
be provided based on the Central Limit Theorem. 


3.4 Kernelization 


We note that, in the above, the classifier derived from MEMPM is given in 
a linear configuration. In order to handle nonlinear classification problems, 
in this section, we seek to use the kernelization trick [22] to map the n- 
dimensional data points into a high-dimensional feature space R-f, where a 
linear classifier corresponds to a nonlinear hyperplane in the original space. 

Since the optimization of MEMPM corresponds to a sequence of BMPM 
optimization problems, this model naturally inherits the kernelization abi¬ 
lity of BMPM. We thus in the following mainly address the kernelization of 
BMPM. 

Assuming training data points are represented by i and {y J } 1 ^ 1 

for the class x and y, respectively, the kernel mapping can be formulated as: 


x <p(x) ~ , 

y ->• <p(y) ~ (v{y), s v (y )). 

where ip : R" —> R-f is a mapping function. The corresponding linear clas¬ 
sifier in Rf is w T tp(z) = 6, where w, ip(z) £ R-f, and b £ R. Similarly, the 
transformed FP optimization in BMPM can be written as: 


max 


1 - k(( 3 0 )yjw T E v ( y) w 


w T S^ x )W 


s.t. w T (<p(x) — ip(y)) = 1. 


(3.42) 


However, to make the kernel work, we need to represent the final decision 
hyperplane and the optimization in a kernel form, K(z i, Z 2 ) = ip(zi) T ip(z 2 ), 
namely an inner product form of the mapping data points. 
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3.4.1 Kernelization Theory for BMPM 

In the following, we demonstrate that although BMPM possesses a signifi¬ 
cantly different optimization form from MPM, the kernelization theory pro¬ 
posed in [16] is still viable, provided that suitable estimates for means and 
covariance matrices are applied therein. 

We first state a theory similar to Corollary 5 of [16] and prove its validity 
in BMPM. 

Corollary 3.8. If the estimates of means and covariance matrices are given 
in BMPM as: 


_ n „ _ n v 

<p( x ) = Y X M x i) > <P(v) = Y^iVj) > 

*=1 3 =1 

N x 

%A X ) = Px 1 n + Y ~ <P( x ))(<P( x i) - V( X )) T ) 

i= 1 

^ _ _ 

^v(y) = Pyin + Y fyiviyj) ~ p(y))Mvj) - </ j (y)) T , 

i=i 

where I n is the identity matrix of dimension n, then the optimal w in problem 
Eq.(3.42) lies in the space spanned by the training points. 

Proof. Similar to Corollary 5 of [16], we write w = w p + Wd , where w p 
is the projection of w in the vector space spanned by all the training data 
points and Wd is the orthogonal component to this span space. It can be 
easily verified that Eq.(3.42) changes to maximize the following: 

l-«(/3o)y wf J2?=i A i (<p(xi)-<p(x))(<p(x i )-<p(x)) T w p +p 1I .(w r fw p +wJw d ) 
\J w p T,*=i n j(v(yj)-v(y))(v{yj)-‘p{y)) T Wv+Py( w v w v+ w !i w d) 

subject to the constraints of w p (<p(x) — ip(y)) = 1. Since we intend to max¬ 
imize the fractional form and both the denominator and the numerator are 
positive, the denominator needs to be as small as possible and the numera¬ 
tor needs to be as large as possible. This would finally lead to Wd = 0. In 
other words, the optimal w lies in the vector space spanned by all the train¬ 
ing data points. Note that the introduction of p x and p y actually enables a 
direct application of the robust estimates in the kernelization. 

According to Corollary 3.8, if appropriate estimates of means and co- 
variance matrices are applied, the optimal w can be written as the linear 
combination of training points. In particular, if we obtain the means and 
covariance matrices as the plug-in estimates, i.e. 
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s <p(y) = jy- - t p(y))( < p{yj) - <^(y)) T , 

v j —i 


we can write w as: 


N m Ny 

w = '52i J 'M x i) + '52 v j<p(yj), 


(3.43) 


where the coefficients fii, Vj £ M for i = 1,..., N x and j = 1,, N y . 

By simply substituting Eq.(3.43) and four plug-in estimates into Eq.(3.42), 
we can obtain the Kernelization Theorem of BMPM. 

3.4.2 Notations in Kernelization Theorem of BMPM 

Before we present the main kernelization result, we first introduce the nota¬ 
tions. Let {z}^L 1 denote all N = N x + N y data points in the training set 
where 


i = 1,2,... ,N X , 
i = N x + 1, N x + 2,..., N. 



The element of the Gram matrix K in the position of (i,j) is defined 
as K hJ = ip(zi) T ip(zj) for i,j = 1, 2,..., N. We further define K x and K y 
as the matrices formed by the first N x rows and the last N y rows of K , 
respectively, namely, 



By setting the row average of the K x block and the K x block to zero, 
the block-row-averaged Gram matrix K is thus obtained: 



where k x ,k y £ M. Nm+Ny are defined as: 
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l No, l Ny 

[k x \i '■= -ITT- 'y ' K(Xj , Zj) , [fey]i := -rr— y ' K{Ui , z i) ■ 


j =1 M ;y—1 

In the above, ljv^ £ and 1« £ E^, are defined as: 


1; = 1, i = 1,2,., 
lj = 1, j=l,2,...,iVy . 

Finally, we define vector formed by the coefficients of 7 as: 

w = [m, ■ ■ ■ , HNa,Vi,V2, ■ ■ ■ ,V Ny ] T ■ 


(3.44) 


3.4.3 Kernelization Results 


Theorem 3.9. [Kernelization Theorem of BMPM] The optimal decision hy¬ 
perplane of the problem Eq.(3.f2) involves solving the Fractional Program¬ 
ming problem: 


l-^)^w?K y K y 

K(a*) = max- . 

s.t. w T (k x — k y ) = 1 . 


The intercept b is calculated as: 


b* = wjk x - k(o*) y = wjky + k(/ 3 0 )^J — wjk y K y w* , 

where «;(a*) is obtained when the above equation attains its optimum (in*, &*). 
For the robust version of BMPM, we can incorporate the variations of the 
means and covariances by conducting the following replacements: 

j _ rp 1 ~ T ~ 

—wlK x K xWit -► wj( — K X K X +p x K)w* , 

i _ rp _ 1 ~ T ~ 

— wjKyKyW* -► wJ(—K y K y +p y K)w * , 
k(/3 0 ) -> k(Po) + p y , 

«:(«*) —> /t(a*) + p, x . 

The optimal decision hyperplane can be represented as a linear form in the 
kernel space 


Ny 

f(z) = Y w *i K ( z i x i ) + Y w *N a ,+iK(z , yf) - 6 * . 
i= 1 2=1 
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3.5 Experiments 

In this section, we first evaluate our model on a synthetic dataset. Then we 
compare the performance of MEMPM with that of MPM, on six real-world 
benchmark datasets (since MPM is reported comparable to SVM, we do 
not perform comparisons with SVM). To demonstrate that BMPM is ideal 
for imposing a specified bias in classification, we also implement it on the 
Heart-disease dataset. The means and covariance matrices for two classes are 
obtained directly from the training datasets by plug-in estimations. The prior 
probability 9 is given by the proportion of x data in the training dataset. 


3.5.1 Model Illustration on a Synthetic Dataset 

To verify that the MEMPM model achieves the minimum Bayes error rate 
in the Gaussian distribution, we synthetically generate two classes of two- 
dimensional Gaussian data. As plotted in Fig. 3.4(a), data associated with the 
class x are generated with the mean x as [3,0] T and the covariance matrix S x 
as [4, 0; 0, 1], while data associated with the class y are generated with the 
mean y as [—1, 0] T and the covariance matrix S y as [1, 0; 0, 5]. The solved 
decision hyperplane z\ — 0.333 given by MPM is plotted as the solid line 
and the solved decision hyperplane z\ = 0.660 given by MEMPM is plotted 
as the dashed line. From the geometrical interpretation, both hyperplanes 
should be perpendicular to the z\ axis. 

As shown in Fig. 3.4(b), the MEMPM hyperplane exactly represents the 
optimal thresholding under the distributions of the first dimension for two 
classes of data, i.e. the intersection point of two density functions. On the 
other hand, we find that the MPM hyperplane exactly corresponds to the 
thresholding point with the same error rate for two classes of data, since the 
cumulative distribution P x (zi < 0.333) and P y {zi > 0.333) are exactly the 
same. 

3.5.2 Evaluations on Benchmark Datasets 

We next evaluate our algorithm on six benchmark datasets. Data for the 
Twonorm problem were generated according to [4]. The rest five datasets 
including the Breast, Ionosphere, Pima, Heart-disease, and Vote data were 
obtained from UCI machine learning repository [3]. Since handling the miss¬ 
ing attribute values is out of the scope of this chapter, we simply remove 
instances with missing attribute values in these datasets. 

We randomly partition data into 90% training and 10% test sets. The 
final results are averaged over 50 random partitions of data. We compare the 
performance of MEMPM and MPM in both the linear setting and Gaussian 
kernel setting. The width parameter (a) for the Gaussian kernel is obtained 
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(b) The optimal thresholding point and the 
MEMPM decision plane 


Fig. 3.4. An evaluation of MEMPM and MPM on a synthetic dataset. 
The decision hyperplane derived from MEMPM (the dashed line) exactly 
corresponds to the optimal threshholding point, i.e. the intersection point, 
while the decision hyperplane given by MPM (the solid line) corresponds 
to the point on which two error rates for two classes of data are equal 


via cross validations over 50 random partitions of the training set. The ex¬ 
perimental results are summarized in Tables 3.1 and 3.2 for the linear kernel 
and Guassian kernel respectively. 

From the results we can see that our MEMPM demonstrates better perfor¬ 
mance than MPM in both the linear and Gaussian kernel setting. Moreover, 
as observed in these benchmark datasets, the MEMPM hyperplanes are ob- 
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Table 3.1. Lower bound a, (3, and test accuracy compared to MPM in the 
linear setting 


Dataset 

Performance of MEMPM(%) 

Performance of MPM(%) 


a 

0 

9a + (1 — 9)0 

Accuracy 

a. 

Accuracy 

Twonorm 

80.3 ± 0.2 

79.9 ± 0.1 

80.1 ± 0.1 

97.9 ± 0.1 

80.1 ± 0.1 

97.9 ± 0.1 

Breast 

77.8 ± 0.8 

91.4 ± 0.5 

86.7 ± 0.5 

96.9 ± 0.3 

84.4 ± 0.5 

97.0 ± 0.2 

Ionosphere 

95.9 ± 1.2 

36.5 ± 2.6 

74.5 ± 0.8 

88.5 ± 1.0 

63.4 ± 1.1 

84.8 ± 0.8 

Pima 

0.9 ± 0.0 

62.9 ± 1.1 

41.3 ± 0.8 

76.8 ± 0.6 

32.0 ± 0.8 

76.1 ± 0.6 

Heart-disease 

43.6 ± 2.5 

66.5 ± 1.5 

56.3 ± 1.4 

84.2 ± 0.7 

54.9 ± 1.4 

83.2 ± 0.8 

Vote 

82.6 ± 1.3 

84.6 ± 0.7 

83.9 ± 0.9 

94.9 ± 0.4 

83.8 ± 0.9 

94.8 ± 0.4 


Table 3.2. Lower bound a, (3, and test accuracy compared to MPM in the 
Gaussian kernel 


Dataset 

Performance of MEMPM(%) 

Performance of MPM(%) 


a. 

0 

9a + (1 - 9)/3 

Accuracy 

a. 

Accuracy 

Twonorm 

91.7 ± 0.2 

91.7 ± 0.2 

91.7 ± 0.2 

97.9 ± 0.1 

91.7 ± 0.2 

97.9 ± 0.1 

Breast 

88.4 ± 0.6 

90.7 ± 0.4 

89.9 ± 0.4 

96.9 ± 0.2 

89.9 ± 0.4 

96.9 ± 0.3 

Ionosphere 

94.2 ± 0.8 

80.9 ± 3.0 

89.4 ± 0.8 

93.8 ± 0.4 

89.0 ± 0.8 

92.2 ± 0.4 

Pima 

2.6 ± 0.1 

62.3 ± 1.6 

41.4 ± 1.1 

77.0 ± 0.7 

32.1 ± 1.0 

76.2 ± 0.6 

Heart-disease 

47.1 ± 2.2 

66.6 ± 1.4 

58.0 ± 1.5 

83.9 ± 0.9 

57.4 ± 1.6 

83.1 ± 1.0 

Vote 

85.1 ± 1.3 

84.3 ± 0.7 

84.7 ± 0.8 

94.7 ± 0.5 

84.4 ± 0.8 

94.6 ± 0.4 


tained with significantly unequal a and (3 except in the Twonorm set. This 
further confirms the validity of our proposition, i.e. the optimal minimax ma¬ 
chine is not certain to achieve the same worst-case accuracies for two classes. 
For the Twonorm, it is also not an exception. The two classes of data in this 
set are generated under the multivariate normal distributions with the same 
covariance matrices. In this special case, the intersection point of two density 
functions will exactly represent the optimal thresholding point and the one 
with the same error rate for each class as well. Another important finding is 
that the accuracy bounds, namely da + (1 — 9)(3 in MEMPM and a in MPM 
are all increased in the Gaussian kernel setting when compared with those 
in the linear setting. This shows the advantage of the kernelized probability 
machine over the linear probability machine. 

In addition, to clearly see the relationship between the bounds and the 
test set accuracies ( TSA ), we plot them in Fig. 3.5. As observed, the test 
set accuracies including TSA X (for the class x), TSA y (for the class y), and 
the overall accuracies TSA are all greater than their corresponding accuracy 
bounds both in MPM and MEMPM. This demonstrates how the accuracy 
bound can serve as the performance indicator on future data. 
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(a) a ’s and TSA„' s in the linear ker¬ 
nel 


(b) a ’s and TSA,' s in the Gaussian 
kernel 




(c) 0 's and TSA/s in the linear ker¬ 
nel 


(d) 0 's and TSA y ' s in the Gaussian 
kernel 




(e) Bounds and TSA’s in the linear 
kernel 


(f) Bounds and TSA’s in the Gaus¬ 
sian kernel 


Fig. 3.5. Empirical evaluations on bounds and test set accuracies of MEMPM. The 
test accuracies including TSA X (for the class x), TSA y (for the class y), and the 
overall accuracies TSA are all greater than their corresponding accuracy bounds 
both in MPM and MEMPM. This demonstrates how the accuracy bound can serve 
as the performance indicator on future data 


It is also observed that the overall worst-case accuracies 9a + (1 — 9)(3 
in MEMPM are greater than a in MPM both in the linear and Gaussian 
settings. This again demonstrates the advantages of MEMPM over MPM. 
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Since the lower bounds keep well with the test accuracies in the above 
experimental results, we do not perform the robust version of both models for 
the real-world datasets. To see how the robust version works we generate two 
classes of Gaussian data. As illustrated in Fig. 3.6, the x data are sampled 
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(b) Robust MPM and MEMPM with v x = v y 


Fig. 3.6. An example in R 2 demonstrates the results of robust versions of 
MEMPM and MPM. Training points are indicated with black +’s for the 
class x and magenta CPs for class y. Test points are represented by blue x’s 
for class x and by green o’s for the class y. In (a), the robust MEMPM out¬ 
performs both MEMPM and the robust MPM. In (b), the robust MEMPM 
with u x v y outperforms the robust MEMPM with v x = v y . 


from the Gaussian distribution with the mean as [3,0] T and the covariance 
as [1 0; 0 3], while the y data are sampled from another Gaussian distribution 
with the mean as [—3,0] T and the covariance as [3 0; 0 1]. We randomly select 
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10 points of each class for training and leave the rest points for test from the 
above synthetic dataset. We present the result in the following. 

First, we calculate the corresponding means x° and y°, covariance matri¬ 
ces and 'Ey and plug them into the linear MPM and the linear MEMPM. 
We obtain the MPM decision line (dotted line) with a lower bound (assuming 
the Gaussian distribution) being 99.1% and the MEMPM decision line (dash- 
dot line) with a lower bound as 99.7% respectively. However, for the test set 
we only obtain the accuracies 93.0% for MPM and 97.0% for MEMPM (see 
Fig. 3.6(a)). This obviously violates the lower bound. 

Based on our knowledge of the real means and covariance matrices in this 
example, we set the parameters as 

v x = \j(x — S°) T Ii' a . _1 (iE — x °) = 0.046 , 
v y = ^(y-y°) T i: y ^ 1 {y-y () ) = 0.496 , 

Px = ||27, -= 1.561 , 
p w = ||17 1 ,-27 w 0 || F = 0.972, 
v = max(i/ x , v y ) . 

We then train the robust linear MPM and the robust linear MEMPM by 
these parameters and obtain the robust MPM decision line (dashed line), the 
robust MEMPM decision line (solid line), as seen in Fig. 3.6(a). The lower 
bounds decrease to 87.3% for MPM and 93.2% for MEMPM respectively, 
but the test accuracies increase to 98.0% for MPM and 100.0% for MEMPM. 
Obviously, the lower bounds accord with the test accuracies. 

Note that in the above, the robust MEMPM also achieves a better per¬ 
formance than the robust MPM. Moreover, v x and v y are not necessarily 
the same. To see the result of MEMPM when v x and v y are forced to be 
the same, we train the robust MEMPM again by setting the parameters as 
v x = v y = v as used in MPM. We obtain the corresponding decision line 
(dash-dot line) as seen in Fig. 3.6(b). The lower bound decreases to 91.0% 
and the test accuracy decreases to 98.0%. The above example indicates how 
the robust MEMPM clearly improves over the standard MEMPM when a 
bias is incorporated by the inaccurate plug-in estimates and also validates 
that v x need not be equal to v y . 

3.5.3 Evaluations of BMPM on Heart-disease Dataset 

To demonstrate the advantages of the BMPM model in dealing with biased 
classifications, we implement BMPM on the Heart-disease dataset, where 
different treatments for different classes are necessary. The x class is associ¬ 
ated with data with heart diseases, whereas the y class corresponds to data 
without heart diseases. Obviously, a bias should be considered for x, since 
misclassification of an x case into the opposite class would delay the therapy 
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and is more risky than the other way round. Similarly, we randomly parti¬ 
tion data into 90% training and 10% test sets. Also, the width parameter 
(a) for the Gaussian kernel is obtained via cross validations over 50 random 
partitions of the training set. We repeat the above procedures 50 times and 
report the average results. 

By intentionally varying /3q from 0 to 1, we obtain a series of test accu¬ 
racies, including the x accuracy TSA X , the y accuracy TSA y for both the 
linear and Gaussian kernels. For simplicity, we denote the x accuracy in the 
linear setting as TSA X (L ), while others are similarly defined. 

The results are summarized in Fig. 3.5. Four observations are worth high¬ 
lighting. First, in both linear and Gaussian kernel settings, the smaller /?o, 
the higher the test accuracy for x. This indicates a bias can be indeed embed¬ 
ded in the classification boundary for the important class x by specifying a 
relatively smaller . In comparison, MPM forces an equal treatment on each 
class and thus is not suitable for biased classification. Second, the test accura¬ 
cies for y and x are strictly lower bounded by /3o and a. This shows how a bias 
can be quantitatively, directly and rigorously imposed towards the important 
class x. Note that again, for other weight-adapting-based biased classifiers, 
the weights themselves lack accurate interpretations and thus cannot rigo¬ 
rously impose a specified bias, i.e. they would try for different weights for a 
specified bias. Third, when given a prescribed fto, the test accuracy for x and 
its worst-case accuracy a in the Gaussian kernel setting are both increased 
compared to the corresponding accuracies in the linear setting. Once again, 
this demonstrates the power of the kernelization. Fourth, we note that /3o 
actually contains an upper bound which is around 90% for the linear BMPM 
in this dataset. This is reasonable. Observed from Eq.(3.11), the maximum 
/3q denoted as /3o ma x is decided by setting a = 0, i.e. 

«(Aw) = max , , s.t. w T (x-y) = 1. (3.45) 

w ¥=o y/w l 2J y w 

It is interesting noting that when /3o is set to zero, the test accuracies for 
y in the linear and Gaussian settings are both around 50% (see Fig. 3.7(b)). 
This seeming “irrationality” is actually reasonable. We will discuss this in 
the next section. 


3.6 How Tight Is the Bound? 


A natural question for MEMPM is how tight is the worst-case bound. In this 
section, we present a theoretical analysis in addressing this problem. 

In Marshall and Olkin Theory, if we define S = {w T y > b}, the theorem 
is changed to: 


sup Pr{w T y > b} = 

y~{y,£y} 


with d 2 = inf (y - y) T S 1 (y - y) . 

w T y~>b 


l + d 21 
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Fig. 3.7. Bounds and real accuracies for BMPM in Heart-disease dataset. 
With /3o varying from 0 to 1, the real accuracies are lower bounded by the 
worst-case accuracies. In addition, a(G) is above a(L), which shows the 
power of the kernelization 


Looking into the above equation and Eq.(3.4), for a given hyperplane 
{in, b} we can easily obtain: 


P 


d 2 

1 + d 2 ' 


(3.46) 


Moreover, in [16], a simple closed-form expression for the minimum dis¬ 
tance d is derived: 
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d 2 = inf ' (y-y) T Sy \y-y ) 

w 1 y>b 


max((& — w T y),0) 

W T £yU) 


(3.47) 


It is easy to see that when the decision hyperplane ( w,b ) passes the center 
y , d would be equal to 0 and the worst-case accuracy [3 would be 0 according 
to Eq.(3.46). 

However, if we consider the Gaussian data (which we assume as y data) 
in Fig. 3.8, a vertical line approximating y would achieve about 50% test 
accuracy. The large gap between the worst-case accuracy and the real test 
accuracy seems strange. In the following, we construct an example of one¬ 
dimensional data to show the inner rationality of this observation. We at¬ 
tempt to provide the worst-case distribution containing the given mean and 
covariance, while a hyperplane passing its mean achieves a real test accuracy 
of zero. 



Worst-case accuracy P 

Fig. 3.8. Theoretical comparison between the worst-case accuracy and the 
real test accuracy for the Gaussian data in Fig. 3.10(a) 


Consider the one-dimensional data y consisting of AT—1 observations with 
values as m and one single observation with the value as a\/N + m. If we 
calculate the mean and the covariance, we obtain: 


y = m + 


Vn ’ 


Sy — 


N - 1 


N 


When N goes to infinity, the above one-dimensional data have the mean as m 
and the covariance as a. In this extreme case, a hyperplane passing the mean 
will achieve a zero test accuracy which is exactly the worst-case accuracy 
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given the fixed mean and covariance as m and a respectively. This example 
demonstrates the inner rationality of the minimax probability machines. 

To further examine the tightness of the worst-case bound in Fig. 3.9(a), 
we vary (3 from 0 to 1 and plot against (3 the real test accuracy that a vertical 





(a) Gaussian data 


(b) Gaussian data 
skewed leftside 


(c) Gaussian data 
skewed rigluside 


Fig. 3.9. Three two-dimensional data with the same means and covariances but 
with different skewness. The worst-case accuracy bound of (a) is tighter than that 
of (b) and looser than that of (c) 


line classifies the y data by using Eq.(3.46). Note that the real accuracy can 
be calculated as <&(z < d ). This curve is plotted in Fig. 3.10. 



Fig. 3.10. Three two-dimensional data with the same means and covari¬ 
ances but with different skewness. The worst-case accuracy bound of (a) is 
tighter than that of (b) and looser than that of (c) 


Observed from Fig. 3.9, the smaller the worst-case accuracy, the looser it 
is. On the other hand, if we skew the y data towards the left side, while simul- 












60 


3 A General Global Learning Model: MEMPM 


taneously maintaining the mean and covariance unchanged (see Fig. 3.9(b)), 
even a bigger gap will be generated when (3 is small; analogically, if we skew 
the data towards the right side (see Fig. 3.9(c)), a tighter accuracy bound will 
be expected. This finding would mean that only adopting up to the second 
order moments may not achieve a satisfactory bound. In other words, for a 
tighter bound, higher order moments such as skewness need to be consid¬ 
ered. This problem of estimating a probability bound based on moments is 
presented as the (n, fc, l7)-bound problem, which means “finding the tightest 
bound for n-dimensional variable in the set 17 based on up to the fc-th mo¬ 
ments.” Unfortunately, as proved in [24], it is NP-hard for (n, k, K”)-bound 
problems with k > 3. Thus tightening the bound by simply scaling up the 
moment order may be intractable in this sense. We may have to exploit other 
statistical techniques to achieve this goal. Certainly, this deserves a closer 
examination in the future. 


3.7 On the Concavity of MEMPM 


We address the issue of the concavity on the MEMPM model in this sec¬ 
tion. We will demonstrate that although MEMPM cannot generally guaran¬ 
tee its concavity, there is strong empirical evidence showing that many real- 
world problems demonstrate reasonable concavity in MEMPM. Hence, the 
MEMPM model can be solved successfully by standard optimization meth¬ 
ods, e.g. the linear search method proposed in this chapter. 

We first present a lemma on BMPM. 

Lemma 3.10. The optimal solution for BMPM is a strictly and monotoni- 
cally decreasing function with respect to /3 q . 

Proof. Let the corresponding optimal worst-case accuracies on x be a\ and 
02 respectively, when /3 qi and A)2 are se t 3 s the acceptable accuracy levels 
for y in BMPM. We will prove that if (3q 1 > flo 2: then aq < 02. 

This can be proved by considering the contrary case, i.e. we assume oi > 
«2- From the problem definition of BMPM, we have: 


or > 02 => k(oi) > k(o2) 

_ x 1 - n(l3 0l )yJw r lEyW 1 1 - k(/302)\A°2 S V W 2 


yjwjs x w 1 


y/wJS x W 2 


,(3.48) 


where, w± and w 2 are the corresponding optimal solutions which maximize 
k(oi) and ^(02) respectively, when /3o 1 and /3o 2 are specified. 

From poi > P 02 aR d Eq.(3.48), we have 

1 - K{l3 02 )yJwj£ y W 1 > 1 - At(/3 0l ) V / Wi T I7 y Wi 
yJwJSxWx \/wi T S x Wi 

1 - K,((3 02 ) s /w 2 T Z;yW 2 
y/w 2 T S x w 2 


> 


(3.50) 
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On the other hand, since w 2 is the optimal solution of 

1 - k{!3 02 )^w t E v w 

max- — - . 

w ^° \/w T X; x w 

we have 

1 - k{/3 02 ) s Jw r lS y w 2 > 1 - 

^/wJS x w 2 ~ ^/wj S x w 1 

This is obviously contradictory to Eq.(3.50). 

From the sequential solving method of MEMPM, we know that MEMPM 
actually corresponds to a one-dimensional line search problem. More specifi¬ 
cally, it further corresponds to maximizing the sum of two functions, namely, 
/i(/3) + / 2 (/3) 6 , where /i(/3) is determined by the BMPM optimization and 
/ 2 (/3) = (3- According to Lemma 3.10, fi(/3) strictly decreases as f3 increases. 
Thus it is strictly pseudo-concave. However, generally speaking, the sum of 
a pseudo-concave function and a linear function is not necessarily a pseudo¬ 
concave function and thus cannot assure that every local optimum is the 
global optimum. This can be clearly observed in Fig. 3.10. In this figure, /i 
is pseudo-concave in three sub-figures; however, the sum /i + f 2 does not 
necessarily lead to a pseudo-concave function. 

Nevertheless, there is strong empirical evidence showing that for many 
“well-behaved” real world classification problems, /i is overall concave, which 
results in the concavity of /i + / 2 . This is first verified by the datasets used 
in this chapter. We shift (3 from 0 to the corresponding upper bound and 
plot out a against (3 in Fig. 3.11. It is clearly observed that in all six datasets 
including both kernel and linear cases, the curves of a against (3 are overall 
concave. This motivates us to look further into the concavity of MEMPM. 
As shown in the following, when two classes of data are “well-separated,” /i 
would be concave in the main “interest” region. 

We analyze the concavity of fi{/3) by imagining that (3 changes from 
0 to 1. In this process, the decision hyperplane moves slowly from y to x 
according to Eq.(3.46) and Eq.(3.47). At the same time, a = fi((3) should 
decrease accordingly. More precisely, if we denote d x and d y respectively as 
the Mahalanobis distances that x and y are from the associated decision 
hyperplane with a specified /?, we can formulate the changing of a and (3 as: 

ol * oc k\(^d x ^Ad x ^ 

/3 -> (3 + k 2 (d y )Ad y , 

where k\(d x ) and k 2 (d y ) can be considered as the changing rate of a and (3 
when the hyperplane lies d x distance far away from x and d y distance far 

6 For simplicity, we assume 6 as 0.5. Since a constant does not influence the 
concavity analysis, the factor of 0.5 is simply dropped. 
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(a) Twonorm 


(b) Breast 




(c) Ionosphere 


(d) Pima 




(e) Heart-disease (f) Vote 


Fig. 3.11. The curves of a against /? (/i) are all concave-like in the datasets 
used in this chapter 
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away from y respectively. We consider the changing of a against (3, namely, 
f'v 

r / ki(dx)Ad x 

1 k2(dy)Ad y 

If we consider that d x and A d y insensitively change against each other or 
change with a proportional rate, i.e. A d x « cAd y (c is a positive constant) 
as the decision hyperplane moves, the above equation can be further written 
as 

r/ kl(d x ) 

h C ^2 (dyT ■ 

Lemma 3.11. (1) If d y > l/-\/3 or the corresponding (3 > 0.25, k 2 (d y ) de¬ 
creases as d y increases. (2) If d x > l/-\/3 or the corresponding a > 0.25, 
k\{d x ) decreases as d x increases. 

Proof. Since (1) and (2) are actually very similar statements, we only prove 
(1). k-iid) is actually the first order derivative of d 2 /(l + d 2 ) according to 
Eq.(3.46). We consider the first order derivative of k 2 (d) or the second order 
derivative of e? 2 /(l + d 2 ). It is easily verified that (d 2 /( 1 + d 2 ))" < 0 when 
d > l/y/3. This is also illustrated in Fig. 3.12. According to the definition of 
the second derivative, we immediately obtain the lemma. Note that d > 1/a/3 
corresponds to (3> 0.25. Thus the condition can be also replaced by j3 > 0.25. 

In the above procedure, d yi (3 increase and d x , a decrease as the hyper¬ 
plane moves towards x. Therefore, according to Lemma 3.11, ki(d x ) increases 
while k 2 {d y ) decreases when a,/3 £ [0.25, 1). This shows that f[ is getting 
smaller as the hyperplane moves towards x. In other words, f” would be 
less than 0 and thus is concave when a, (3 £ [0.25, 1). It should be noted 
that in many well-separated real world datasets, the optimal a and (3 will be 
greater than 0.25 with a high possibility, since to achieve good performance, 
the worst-case accuracies are naturally required to be greater than a smaller 
amount, e.g. 0.25. This is observed in the datasets used in the chapter. All 
the datasets except Pima attain their optimums satisfying this constraint. 
For Pima, the overall accuracy is relatively lower, which implies that two 
classes of data in this dataset appear to largely overlap each other 7 . 

An illustration can be also seen in Fig. 3.13. We generate two classes of 
Gaussian data with x = [0, 0] T , y = [L, 0] T , and S x = S y = [1, 0;0, 1]. 
The prior probability for each data is set as an equal value 0.5. We plot 
the curves of /i(/3) and fi(/3) + (3 when L is set as different values. It is 

7 It is observed, even for Pima, the proposed solving algorithm is still successful, 
since a is approximately linear as shown in Fig. 3.11. Moreover, due to the fact 
that the slope of a is slightly greater than —1, the final optimum naturally leads (3 
to achieve its maximum. 
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Fig. 3.12. An illustration of the concavity of MEMPM. Subfigure (a) 
shows that when two classes of data overlap largely each other, the optimal 
solution of MEMPM lies in the small-value range of a and f3 which is usually 
not concave, (b), (c), and (d) show that when two classes of data are well- 
separated, the optimal solutions lie in the region with a, /3 £ [0.25, 1) which 
is often concave 


observed that when two classes of data largely overlap each other, for example 
in Fig. 3.12(a) with L = 1, the optimal solution of MEMPM lies in the 
small-value range of a and /?, which is usually not concave. On the other 
hand, Fig. 3.12(b), (c), and (d) show that when two classes of data are well- 
separated, the optimal solutions lie in the region with a,/3 £ [0.25, 1), which 
is often concave. 

Note that, in the above, we make an assumption that as the decision hy¬ 
perplane moves, d x and d y change at an approximately fixed proportional 
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Fig. 3.13. The curve of d 2 /(l + d 2 ). This function is concave when 
d > 1/V3 


rate. From the definition of d x and d y , this assumption implies that w, 
the direction of the optimal decision hyperplane, is insensitive to [3. This 
assumption does not hold in all cases; however, observed from the geometrical 
interpretation of MEMPM, for those data with isotropic or not significantly 
anisotropic S x and S y , w would be indeed insensitive to /3. 

We summarize the above analysis into the following proposition. 

Proposition 3.12. Assuming (1) two classes of data are well-separated and 
(2) d x and d y change at an approximately fixed proportional rate as the 
optimal decision hyperplane (associated with a specified (3) moves, the one¬ 
dimensional line search problem of MEMPM is often concave in the range of 
a, (3 £ [0.25,1) and will often attain its optimum in this range. Therefore the 
proposed solving method leads to a satisfactory solution. 

Remarks. As demonstrated in the above, although MEMPM is often overall 
concave in real world tasks, there exist cases that MEMPM optimization 
problem is not concave. This may lead to the case that the solved local 
optimum, based on the SBMPM method, is not the global optimum. In these 
instances, we may need carefully choose the initial starting point. In addition, 
the physical interpretation of /3 as the worst-case accuracy, may make it 
relatively easy to choose a suitable initial value. For example, we can set the 
initial value by using the information obtained from prior domain knowledge. 


3.8 Limitations and Future Work 

In this section, we present the limitations and future work. 

First, although MEMPM achieves better performance than MPM, its 
sequential optimization of Biased Minimax Probability Machine may cost 
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more training time than MPM. In our experiments, MEMPM needs to solve 
5 — 15 BMPM optimizations on the average. Supposing that BMPM is solved 
based on Conjugate Gradient Methods (with a worst-case time complexity 
in the same order as MPM), MEMPM would be 5 — 15 times as expensive as 
MPM. Although in pattern recognition tasks, especially in off-line classifica¬ 
tions, effectiveness is often more important than efficiency, expensive time- 
cost presents one of the main limitations of the MEMPM model, in particular 
for large scale datasets with millions of samples. To solve this problem, one 
possible direction is to reduce those redundant points which actually make 
less contributions to the classification. In this way, the problem dimension 
(in the kernelization) would be greatly decreased and therefore may help in 
reducing the computational time required. Another possible direction is to 
exploit some techniques to decompose the Gram matrix (as is done in SVM) 
and to develop some specialized optimization procedures for MEMPM. Un¬ 
doubtedly, speeding up the algorithm will be a highly worthy topic in the 
future. 

Second, as a generalized model, MEMPM actually incorporates some 
other variations. For example, when the prior probability (0) cannot be esti¬ 
mated reliably (e.g. in sparse data), maximizing a + /3, namely the sum of the 
accuracies or the difference between true positive and false positive, would 
be considered. This type of approaches is widely used in pattern recognition 
field, e.g. in medical diagnosis [10] and in graph detection, especially line 
detection and arc detection, where it is called Vector Recovery Index [9, 17]. 
Moreover, when there are domain experts at hand, a variation of MEMPM, 
namely, the maximization of C x a + C y [3 may be used, where C x ( C v ) is the 
cost of a misclassification of x ( y ) obtained from experts. Exploring these 
variations in some specific domains is thus a valuable direction in the future 
(we actually will discuss these variations as criteria for biased or imbalanced 
learning in Chapter 5). 

Third, [16] has built up a connection between MPM and SVM from the 
perspective of the margin definition, i.e. MPM corresponds to finding the 
hyperplane with the maximal margin from the class center. Nevertheless, 
some deeper connections need to be investigated, e.g. how is the bound of 
MEMPM related to the generation bound of SVM? More recently, [11] and 
also the next chapter have disclosed the relationship between them from 
either a local or global viewpoint of data. It is particularly useful to look into 
these links and explore their further connections in the future. 


3.9 Summary 

In this chapter, we have proposed a novel global learning model named Mini¬ 
mum Error Minimax Probability Machine. By minimizing the upper bound of 
the Bayes error of future data points, our model derives the distribution-free 
Bayes optimal hyperplane in the worst-case setting. This thus distinguishes 


References 


67 


itself from the traditional global learning approaches, or more particularly 
from traditional Bayes optimal classsifers. More importantly, we have shown 
that the worst-case Bayes optimal hyperplane derived by MEMPM becomes 
the true Bayes optimal hyperplane, when some conditions are satisfied, e.g. 
when a Gaussian distribution is assumed on data. We have shown that how 
to exploit Mercer kernels in this setting to derive a nonlinear classification 
boundary. We also have demonstrated that how a robust framework can be 
introduced to make solid the foundation of the proposed model. Moreover, we 
have demonstrated that this novel model permits an explicit accuracy bound 
on future data theoretically and validate this proposition empirically as well. 
We have evaluated our algorithms on both synthetic datasets and real-world 
benchmark datasets. The performance of MEMPM is demonstrated to out¬ 
perform MPM, a comparable model with SVM. 
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4 


Learning Locally and Globally: Maxi-Min 
Margin Machine 


The proposed MEMPM model obtains the decision hyperplane by using only 
global information, e.g. the mean and covariance matrices. However, although 
these moments can be more reliably obtained than estimating the distribu¬ 
tion, they may still be inaccurate in many cases, e.g. when the data are very 
sparse. 

Recently, local learning methods, especially large margin classifiers [19] 
have attracted much interest in the community of machine learning and pat¬ 
tern recognition. Support Vector Machine (SVM) [25], the most famous one 
of them, represents a state-of-the-art classifier. The essential point of SVM 
is to find a linear separating hyperplane, which achieves the maximal mar¬ 
gin among different classes of data. Furthermore, one can extend SVM to 
build nonlinear separating decision hyperplanes by exploiting kernelization 
techniques. 

These methods do not try to summarize any global information before¬ 
hand, but to focus on obtaining the decision hyperplane in a “local” way. For 
example, in SVM the decision boundary is exclusively determined by some 
critical points which are called support vectors, whereas all other points are 
totally irrelevant to this hyperplane. Although this scheme is both theoret¬ 
ically and empirically demonstrated to be powerful, it actually discards the 
global information of data. 

An illustration example can be seen in Fig. 4.1. In this figure, the clas¬ 
sification boundary is intuitively observed to be mainly determined by the 
dotted axis, i.e. the long axis of the y data (represented by CPs) or the short 
axis of the x data (represented by o’s). Moreover, along this axis, the y data 
are more possible to scatter than the x data, since y contains a relatively 
larger variance in this direction. Noting this “global” fact, a good decision 
hyperplane seems reasonable to lie closer to the x side (see the dash-dot line). 
However, SVM ignores this kind of “global” information, i.e. the statistical 
trend of data occurrence: the derived SVM decision hyperplane (the solid 
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line) lies unbiasedly right in the middle of two “local” points (the support 
vectors) 1 * . 



Fig. 4.1. A decision hyperplane with considerations of both local 
and global information 


Aiming to construct classifiers both locally and globally, we propose the 
Maxi-Min Margin Machine (M 4 ) in this chapter. We will attempt to combine 
the local learning into the global information, i.e. the covariance information, 
which can represent the data trend. Moreover, as this model also contains 
the properties of local learning, it will naturally neutralize the impact when 
the global information is inaccurate. 

As we show later, one critical contribution of this novel model is that 
M 4 actually presents a unified model of SVM and another recently-proposed 
promising model Minimax Probability Machine (MPM) [11]. Moreover, based 
on our proposed local and global view of data, another popular model, Fisher 
Discriminant Analysis (FDA) [4] can also be interpreted as its special case. 

Another good feature of the M 4 model is that it can be cast as a se¬ 
quential Conic Programming problem [17], or more specifically, a sequential 
Second Order Cone Programming (SOCP) problem [12, 15, 10], which thus 
can be practically solved in polynomial time. In addition, with incorporating 
the global information, a reduction method is proposed for decreasing the 
computation time of this new model. 

The third important feature of our proposed model is that the kerneliza- 
tion methodology is also applicable for this formulation. This thus generalizes 
the linear M 4 to a more powerful classification approach which can derive 
nonlinear decision boundaries. 

The rest of this chapter is organized as follows. In the next section, we 
introduce the M 4 model in detail, including its model definition, the geometri- 

1 This figure has appeared earlier in Chapter 2. However, for the purpose of 

self-containing for each chapter, we still present it here. 
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cal interpretation, connections with other models, and the associated solving 
methods. In Section 4.2, we derive a generation bound for the M 4 model. In 
Section 4.3, we develop a reduction method to remove redundant points for 
decreasing the computational time. In Section 4.4, we exploit the kerneliza- 
tion trick to extend M 4 to nonlinear classification tasks. In Section 4.5, we 
evaluate this novel model on both synthetic datasets and real world bench¬ 
mark datasets. In Section 4.6, we make discussions on the M 4 model and also 
present future work. Finally, we conclude this chapter in Section 4.7. This 
work can be also seen in [5] [7] for a short version. 


4.1 Maxi-Min Margin Machine 

In the following, we first, for the purpose of clarity, divide M 4 into sep¬ 
arable and nonseparable categories, and then introduce the corresponding 
hard-margin M 4 and soft-margin M 4 sequently. In this section, we will also 
establish the connections of the M 4 model with other large margin classi¬ 
fiers including SVM, MPM, FDA and Mininum Error Minimax Probability 
Machine (MEMPM) [6], 

4.1.1 Separable Case 

Assuming the classification samples are separable, we first introduce the 
model definition and the geometrical interpretation. We then transform the 
model optimization problem into a sequential SOCP problem and discuss the 
detailed optimization method. 


4.1.1.1 Problem Definition 

Only two-category classification tasks are considered in this chapter. Let a 
training dataset contain two classes of samples represented by Xj, £ I n and 
y,j £ R” respectively, where i = 1,2,..., N x , j = 1,2,..., N y . The basic task 
here can be informally described to find a suitable hyperplane f(z) = w T z+b 
separating two classes of data as robustly as possible ( w £ R"\{0}, b £ R, 
and w T is the transpose of w ). Future data points 2 for which f(z) > 0 are 
then classified as the class x ; otherwise, they are classified as the class y. 

The formulation for M 4 can be written as: 


s.t. 


max p , 

p,i«7^0,6 


(4.1) 

( w T Xi + b) 

/—- — P i 

Y it) 1 S x w 

i=l,2,...,N x , 

(4.2) 

~(w T y j+b) 

— , — > p ■ 

\JW T UyW 

1 j = 1) 2, • • • , Ny , 

(4.3) 
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where S x and S y refer to the covariance matrices of the x and the y data, 
respectively. 

This model tries to maximize the margin defined as the minimum Maha- 
lanobis distance for all training samples,while simultaneously classifying all 
the data correctly. Compared to SVM, M 4 incorporates the data information 
in a global way; namely, the covariance information of data or the statistical 
trend of data occurrence is considered, while SVMs, including Zj-SVM [27] 
and Z 2 -SVM [24] (7 p -SVM means the “p-norm” distance-based SVM) [19], 
simply discard this information or consider the same covariance for each 
class. 

4.1.1.2 Geometrical Interpretation 

A geometrical interpretation of M 4 can be seen in Fig. 4.2. In this figure, the 



Fig. 4.2. A geometric interpretation of M 4 . The M 4 hyperplane corre¬ 
sponds to the tangent line (the solid line) of two small dashed ellipsoids 
centered at the support vectors (the local information) and shaped by the 
corresponding covariances (the global information). It is thus more reason¬ 
able than SVM (the dotted line) 


x data are represented by the inner ellipsoid on the left side with its center 
as Xq, while the y data are represented by the inner ellipsoid on the right 
side with its center as y 0 . It is observed that these two ellipsoids contain 
unequal covariances or risks of data occurrence. However, SVM does not 
consider this global information: its decision hyperplane (the dotted line) is 
located unbiasedly in the middle of two support vectors (filled points). In 
comparison, M 4 defines the margin as a Maxi-Min Mahalanobis distance, 
which thus constructs a decision plane (the solid line) with considerations 
of both the local and global information: the M 4 hyperplane corresponds to 
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the tangent line of two dashed ellipsoids centered at the support vectors (the 
local information) and shaped by the corresponding covariances (the global 
information). 

4.1.1.3 Optimization Method 

In the following, we propose the optimization method for the M 4 model. We 
will demonstrate that the above problem can be cast as a sequential Conic 
Programming problem, or more specifically, a sequential SOCP problem. 

Our strategy is based on the “Divide and Conquer” technique. One may 
note that in the optimization problem of M 4 , if p is fixed to a constant p n , the 
problem is exactly changed to “conquer” the problem of checking whether 
the constraints of Eqs.(4.2) and (4.3) can be satisfied. Moreover, as will be 
demonstrated shortly, this “checking” procedure can be stated as an SOCP 
problem. Thus the problem now becomes that how p is set, which we can 
use “divide” to handle: if the constraints are satisfied, we can increase p n 
accordingly; otherwise, we decrease p n . 

We detail this solving technique in the following two steps: 

(1) Divide: Set p n = (po + p m )/2, where po is a feasible p, p m is an infeasible 
p, and po < pm- 

(2) Conquer: Call the Modified Second Order Cone Programming (MSOCP) 
procedure elaborated in the following to check whether p n is a feasible p. 
If yes, set po = p„; otherwise, set p m = p n . 

In the above, if a p satisfies the constraints of Eqs.(4.2) and (4.3), we call it 
a feasible p; otherwise, we call it an infeasible p. These two steps are iterated 
until po — p m | is less than a small positive value. 

We propose the following Theorem 4.1 showing that the MSOCP proce¬ 
dure, namely, the checking problem with p fixed to a constant p n , is solvable 
by casting it as an SOCP problem. 

Theorem 4.1. The problem of checking whether there exist a w and a b 
satisfying the following two sets of constraints Eqs.(f.f) and (4-5) can be 
transformed as an SOCP problem which can be solved in polynomial time, 

( w T Xi + b)> p n \/w T E x w, i = 1,..., N x , (4.4) 

-(w T y 3 +b) > p n ^w T E y w, j = 1,..., N y . (4.5) 

Proof. Introducing dummy variables t, we rewrite the above checking prob¬ 
lem as an equivalent optimization problem: 

Nv+Ny 

max { min Tfc} 

w^O ,b,T k= 1 

s.t. (w T Xi + b) > p n \fw T E x w - Ti , 

~{w T y j + b)> p n yJw T E y w - t j+Nsb , 
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where i = 1,..., N x and j = 1,, N y . 

By checking whether the minimum at the optimum point is positive, 
we can know whether the constraints of Eqs.(4.2) and (4.3) can be satisfied. 
If we go further, we can introduce another dummy variable and transform 
the above problem into an SOCP problem: 


max r/ 

w^O,b,T ,rj 



V < Tfc 


where i = 1,..., N x , j = 1,..., N y , and k = 1,..., N x + N y . By checking 
whether the optimal 77 is greater than 0, we can immediately know whether 
there exist a w and a b satisfying the constraints of Eqs.(4.2) and (4.3). 
Moreover, the above optimization is easily verified to be the standard SOCP 
form, since the optimization function is a linear form and the constraints are 
either linear or the typical second order conic constraints. 

Remarks. In practice, many SOCP programs, e.g. Sedumi [20], provide 
schemes to directly handle the above checking procedure. It thus need not 
introduce dummy variables as what we have done in the proof. 

We now analyze the time complexity of M 4 . As indicated in [12], if the 
SOCP is solved based on interior-point methods, it contains a worst-case 
complexity of 0(n 3 ). If we denote the range of feasible p’s as L = p ma x — Pmin 
and the required precision as e, then the number of iterations for M 4 is 
log(L/e) in the worst case. Adding the cost of forming the system matrix 
(constraint matrix) which is 0(Nn 3 ) (TV represents the number of training 
points), the total complexity would be 0(log(L/ e)n 3 +Nn 3 ) ~ 0(Nn 3 ) which 
is relatively large but can still be solved in polynomial time 2 . 

4.1.2 Connections with Other Models 

In this section, we establish connections between M 4 and other models. We 
show that SVM and MPM are actually special cases of our model. Moreover, 
FDA can be interpreted and extended according to our local and global views 
of data. 

4.1.2.1 Connection with Minimax Probability Machine 

If one expands the constraints of Eq.(4.2) and adds all of them together, one 
can immediately obtain the following equation: 


2 Note that the system matrix needs to be formed only once. 
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N„ 

tt) T Xi + N x b > N x p\/w T U x w => w T x + b > p\Jw T E x w , (4.6) 

2 = 1 

where x denotes the mean of the x training data. 

Similarly, from Eq.(4.3) one can obtain: 


AC, 


(™ T y j + N y bS > - N vP\j' 

3 = 1 

=> —(w T y + b) > pJw T U y w , 


w T S.,w 


y l 


where y denotes the mean of the y training data. 
Adding Eqs.(4.6) and (4.7), one can obtain: 


(4.7) 


max p 

p,w 

s.t. w T (x — y) > p{\/w T S x w + w T S y w) . (4.8) 

The above optimization is exactly the MPM optimization [11]. Note, how¬ 
ever, that the above procedure cannot be reversed. This means that MPM is 
a special case of M 4 . 

Remarks. In MPM, since the decision is completely determined by the global 
information, namely, the mean and covariance matrices [11] , to assure an ac¬ 
curate performance the estimates of mean and covariance matrices need to 
be reliable. However, it cannot always be the case in real world tasks. On 
the other hand, M 4 seems to solve this problem in a natural way, because 
the impact caused by inaccurately estimated mean and covariance matrices 
can be neutralized by utilizing the local information, namely by satisfying 
those constraints of Eqs.(4.2) and (4.3) for each local data point. This is also 
demonstrated in the later experiment. 


4.1.2.2 Connection with Support Vector Machine 

If one assumes S x = S y = S, the optimization of M 4 can be changed as: 

max p, 

s.t. ( w T Xi + b) > p\/w T Sw , 

— (w T yj + b) > pV w T £w , 

where i = 1,..., N x and j = 1,..., N y . 

Observing that the magnitude of w will not influence the optimization, 
without loss of generality, one can further assume pVw T Uw = 1. Therefore 
the optimization can be changed as: 

3 This can be directly observed from Eq.(4.8). 
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min w T £w, (4.9) 

w^O ,b 

s.t. ( w T Xi + b) > 1 , (4-10) 

-{w T y J +b)> 1, (4.11) 

where i = 1,..., N x and j = 1,..., N y . 

A special case of the above with £ = I is precisely the optimization of 
SVM, where I is the identity matrix. 

Remarks. In the above, two assumptions are implicitly made by SVM: One 
is the assumption on data “orientation” or data shape, i.e. £ x = £ y = £, 
and the other is the assumption on data “scattering magnitude” or data 
compactness, i.e. £ = I. However, these two assumptions are inappropriate. 
We demonstrate this in Figs. 4.3 and 4.4. We assume the orientation and 
the magnitude of each ellipsoid represent the data shape and compactness, 
respectively, in these figures. 



Fig. 4.3. An illustration on that SVM omits the data compactness 
information 


Fig. 4.3 plots two types of data with the same data orientations but differ¬ 
ent data scattering magnitudes. It is obvious that by ignoring data scattering 
SVM is improper to locate itself unbiasedly in the middle of the support vec¬ 
tors (filled points), since x is more possible to scatter on the horizontal axis. 
Instead, M 4 is more reasonable (see the solid line in this figure). Furthermore, 
Fig. 4.4 plots the case with the same data scattering magnitudes but different 
data orientations. Similarly, SVM does not capture the orientation informa¬ 
tion. In comparison, M 4 grasps this information and demonstrates a more 
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Fig. 4.4. An illustration on that SVM discards the data orienta¬ 
tion information 


suitable decision plane: M 4 represents the tangent line between two small 
dashed ellipsoids centered at the support vectors (filled points). Note that 
SVM and M 4 do not need to achieve the same support vectors. In Fig. 4.4, 
M 4 contains the above two filled points as support vectors, whereas SVM has 
all the three filled points as support vectors. 


4.1.2.3 Link with Fisher Discriminant Analysis 

FDA, an important and popular method, is used widely in constructing de¬ 
cision hyperplanes and reducing the feature dimensionality. In the following 
discussion, we mainly consider its application as a classifier. FDA involves 
solving the following optimization problem: 

max —StlL . 
yJw T S x w + w T UyW 

Similar to MPM, FDA also focuses on using the global information rather 
than considering data both locally and globally. We now show that FDA can 
be modified to consider data both locally and globally. 

If one changes the denominators in Eqs.(4.2) and (4.3) as y /w T s J , W + W T UyW , 
the optimization can be changed as: 
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max p, 

p, w t^O, b 

(4.12) 

(w T Xi + b ) 

s.t. —, = > p , 

\/w T S x w + w T SyW 

(4.13) 

-{w T y. +b) 

^- > P 1 

(4.14) 


yJw T U X W + W T S y W 


where i = 1,..., N x and j = 1,..., N y . The above optimization is actually 
a generalized case of FDA, which considers data locally and globally. This is 
verified as follows. 

If one performs the procedure similar to that of Section 4.1.2.1, the above 
optimization problem is easily verified to be the following optimization: 

max p, (4-15) 

p,w^O,b 

s.t. w T (x — y)> p\j w T T, x w + w T Y, y w . 

One can change Eq.(4.15) as: p < \ w ( x ~w)l which is exactly the 

6 ' 1 ~ s /w T S as w+w T S y w’ J 

optimization of the FDA (w T (x — y) is implicitly implied as a positive value 
from Eqs.(4.13) and (4.14)). 

Remarks. The extended FDA optimization actually focuses on considering 
the data orientation, while omitting the data scattering magnitude informa¬ 
tion. Using the analysis similar to that of Section 4.1.2.2, we can know that 
the extended FDA lacks the consideration on the data scattering magnitude. 
Its decision hyperplane in the example of Fig. 4.3 coincides with that of 
SVM. With respect to the data orientation, it actually uses the average of 
covariances for two types of data. As illustrated in Fig. 4.5, the extended 
FDA corresponds to the line lying exactly in the middle of the long axes of 
the x and y data. This shows that the extended FDA considers the data 
orientation partially yet incompletely. 


4.1.3 Nonseparable Case 


In this section, we modify the M 4 model to handle the nonseparable case. 
We need to introduce slack variables in this case. The optimization of M 4 is 
changed as: 


max 

p.w^O.b,^ 


£fc >0, 


N*+Ny ^ 


c £ {A, 

(4.16) 

k-1 J 


/ w T S x w - , 

(4.17) 

jw T SyW^i j+Nrc , 

(4.18) 
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Fig. 4.5. An illustration on that FDA partly yet incompletely 
considers the data orientation 


where i = 1,..., N x , j = 1,..., N y , and k = 1,..., N x + N y . C is the positive 
penalty parameter and £& is the slack variable which can be considered as 
the extent how the training point z k disobeys the p margin (z k = x k when 
1 < k < N x ; z k = y k _ Ny when N x + 1 < k < N x + N y ). Thus E^* 
can be conceptually regarded as the training error or the empirical error. 
In other words, the above optimization achieves maximizing the minimum 
margin while minimizing the total training error. 

4.1.3.1 Solving Method 

As clearly observed, when p is fixed, the optimization is equivalent to mini¬ 
mizing E k=t Ny under the same constraints. This is once again an SOCP 
problem and thus can be solved in polynomial time. We can then update p 
according to some rules and repeat the whole process until an optimal p is 
found. This is once again the so-called line search problem. We still adopt 
Quadratic Interpolation method to solve this problem, which converges su- 
perlinearly to the global optimum if suitable starting points are assigned [1]. 
Since we have introduced this linear search method in Chapter 3, we simply 
omit it here. 

In summary, we iterate the following two steps to solve the modified op¬ 
timization. 

Step 1. Generate a new p n from three previous Pi,P 2 ,P 3 by using the 
Quadratic Interpolation method. 

Step 2. Fix p = p n , perform the optimization based on SOCP algorithms. 
Update pi,p 2 ,P 3 - 
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4.1.4 Further Connection with Minimum Error Minimax 
Probability Machine 


In this section, we show how the M 4 can be connected with Minimum Er¬ 
ror Minimax Probability Machine [6], which is a worst-case Bayes optimal 
classifier and a superset of MPM as well. 

If one looks into carefully the optimization of nonseparable M 4 , a more 
precise form is the one replacing with \Jw T 2J x w in Eq.(4.17) and 
^/w T S y w in Eq.(4.18). However, this optimization may prove to be a 
difficult problem. Nevertheless, we can start from this precise form and de¬ 
rive the connection of M 4 with MEMPM. 

We reformulate the optimization of Eqs.(4.17) and (4.18) as their precise 
forms as follows: 


s.t. 


max 

p,w^O,b,£ 


w T Xj + b 


Nn+Ny 

p-c Y 


k =1 


^W T S X W 
w T y j + b 

•y/t l) T ZJyW 

e fc >o, 


>P 


> P — tij+Na, > 


(4.19) 

(4.20) 

(4.21) 

(4.22) 


where i = 1,..., N x , j — 1,..., N y , and k = 1,. .., N x + N y . 

Maximizing Eq.(4.20) contains a similar meaning as minimizing 

N^+Ny 

B £k + l/p 2 (B is a positive parameter) in a sense that they both 

fc=l 

attempt to maximize the margin p and minimize the error rate. If we con- 

-Naj+iVy 

sider £ fc as the residue and regard 1/p 2 as the regularization term, the 

k —1 

optimization can be cast into the framework of solving ill-posed problems. 4 
According to [24, 26], the above optimization pointed as the Tikhonov’s 
Variation Method [22] is equivalent to the optimization below refereed to 
Ivannov’s Quasi-Solution Method [8],in the sense that if one of the methods 
for a given value of the parameter (say C) produces a solution {w,b}, then 
the other method can derive the same solution by adapting its corresponding 
parameter (say A). 


4 A trick can be made by assuming 1/p 2 as a new variable and thus the condition 
that the regularization is convex can be satisfied. 
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N^+Ny 


mil1 „ 

p,ni^o,fc,4 k=1 

(4.23) 

w T Xi + b 

/— T v - - P > 

V 2J x w 

(4.24) 

wT Vj + b ^ n c 
yjw 1 EyW - P * J + N - ’ 

(4.25) 

P>A,£ k >0, 

(4.26) 


where A is a positive constant parameter. 

Now if we expand Eq.(4.24) for each i and add them all together, we can 
obtain: 


N* 


w T x+b ^ 

/ T y, z ~ Nx P ■ 

V W ^ 2=1 


This equation can easily be changed as: 


N~ 


y' — Nxp n x 


w l x + b 
\/w T £ x w 


(4.27) 


(4.28) 


Similarly, if we expand Eq.(4.25) for each j and add them all together, we 
obtain: 


Ny 

Zj + Na. — NyP + Ny 


3 =1 


w T y + b 

SyW 


By adding Eq.(4.28) and Eq.(4.29), we obtain: 


N 


Y.Zk>Np- 

k =1 


(n„ wT * + b 
\ ^/w T S x w 


-N„ 


w T y + b 

yj W T £ yW 


(4.29) 


(4.30) 


N^+Ny 

To achieve minimum training error, namely, min pu) ^ 0() £ we 

k —1 

may consider to minimize its lower bound as specified by the right hand side 
of Eq.(4.30). Hence in this case p should attain its lower bound A, while the 
second part should be as large as possible, i.e. 


max 

Wj£0,b 


w T x + b 
\fw T i: x w 


( 1 - 0 ) 


w T y + b 

y/w T SyW 


(4.31) 


where 0 is defined as N x /N and thus 1 — 0 denotes N y /N. If one further 
transforms the above to: 
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max {6t + (1 — 0 ) 5 }, 

w^O ,b 

w T x + b 

s.t. — , > t , 


\Jw T S x w 
w T y + b 

SyW 


> S , 


(4.32) 

(4.33) 

(4.34) 


one can see that the above optimizes a very similar form as the MEMPM 
model except that Eq.(4.33) changes to [6] 


mm ,{07WT 2 + (! - 


d^O ,!> 1 + t 2 


„2 

' 1 + S 2 ' 


In MEMPM, t 2 s 2 /{1 + f 2 )(l + s 2 ) (denoted as a (/?)) represents the worst- 
case accuracy for the classification of future x (y) data. Thus MEMPM max¬ 
imizes the weighted accuracy on the future data. In M 4 , s and t represent the 
corresponding margin which is defined as the distance from the hyperplane 
to the class center. Therefore, it represents the weighted maximum margin 
machine in this sense. Moreover, since the function of g{u ) = it 2 /( 1 + u 2 ) 
increases monotonically with u, maximizing the above formulae contains a 
physical meaning similar to the optimization of MEMPM in some sense. 
Remarks. Implicit constraints are contained for the optimization of the 
above derived special case of M 4 . Empirically, Eq.(4.27) cannot achieve the 
equality in the normal case, since Eqs.(4.24) and (4.25) can only achieve 
equalities for support vectors. Moreover, the slack variables are usually far 
smaller than p. This implies we can consider 

w T x + b 

, > p = A. 

Analogously, for y , a similar statement can be obtained. The presence of 
these two constraints is essential, since with the constraints the parameter p 
is involved in the optimization. Moreover, these two constraints also prevent 
the circumstance that the decision hyperplane is extremely far away from one 
class center, while being very close to the other class center. 


4.2 Bound on the Error Rate 

In this section, we provide theoretical results on the bound of the error rate 
of M 4 . We first borrow the leave-one-out theorem from [13] and [25]. 

Lemma 4.2. The leave-one-out estimator is almost unbiased. 

We then present the generation bound of M 4 as the following theorem: 
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Theorem 4.3. If (1) the training set containing N samples is separated by 
the decision hyperplane derived by M 4 and (2) the mean and covariance ma¬ 
trices are reliably estimated, then the expectation of the probability of the test 
error is bounded by the expectation of the minimum of two values: the ratio 


m/N and 6 -=- 

l + dl 


+ (1 - 0 ) 


1 

1 + d y ’ 


where m is the number of support vectors, d x and d y are the correspond¬ 
ing Mahalanobis distances from the class centers x and y to the decision 
hyperplane, and 9 is prior probability of the x data. Namely, 


I- f \‘rr<n\ f E | min 


N' J l + dl +{ ~ 1 + d\ 


(4.35) 


Proof. According to Lemma 4.2, to prove E[P error ] < E[jf], we only need 
to show that the number of errors by the leave-one-out method does not 
exceed the number of support vectors. Actually, this is the case. If we leave a 
non-support vector out and then we perform training on the remaining data, 
the decision hyperplane will not change, since the decision hyperplane is just 
decided by support vectors and the covariance matrices (statistically, one 
point will not influence the covariance of data). Therefore, this non-support 
vector will be recognized correctly. Thus the leave-one-out method classifies 
correctly all the samples that are not support vectors, i.e. the number of the 
leave-one-out errors does not exceed the number of the support vectors. 


0T^ + (l-0)l^3j]}- Accor 


We next prove -E[P error ] < E jo.,.* , 

ding to [11, 6, 14], if the means and covariances are reliably estimated, 
d%./( 1 + d x 2 ) and d y /( 1 + d y 2 ) represent the worst-case rates in recognizing 
correctly the x data and y data respectively. Therefore, 

‘ 1 +( 1 - 0 ) 1 


l + dl 


1 + d l 


represents the expected maximum error rate, i.e. 


E [Perror] < E \ min 


m ' 1 +(1-9) 1 


N’ l + dl 


l + dl 


Remarks. Note that the above two items actually represent two meanings 
of the M 4 model, i.e. minimizing the leave-one-out error presents the contri¬ 
bution by considering the local information from data; on the other hand, 
the second item describes the effect by considering the global information 
from data. Moreover, if we further examine the second item, d x ( d y ) is ac¬ 
tually determined by two parts: the Mahalanobis distance from the support 
vectors to the corresponding class center x (y) and the margin p. This can 
be observed in Fig. 4.2. Intuitively, the larger the margin p is, the larger d x 
and d y are, which leads to a smaller expected test error in the future. This 
motivates the margin maximization in the large margin machines. 
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4.3 Reduction 


The variables in previous sections are [to, 6, £i,..., £jv„, ■ • ■, ^N^+Ny], whose 
dimension is n + 1 + N x + N y . The number of the second order conic con¬ 
straints is easily verified to be N x + N y . This size of the generated constraint 
matrix will be a big number and may thus encounter problems in solving 
large scale classification tasks. Therefore, we should reduce both the number 
of constraints and the number of variables. 

Since this problem is caused by the number of the data points, we con¬ 
sider removing some redundant points to reduce both the space and time 
complexity. The reduction rule is introduced as follows. 


Reduction Rule: Set a threshold v £ [0, 1). In each class, calculate the 
Manhalanobis distance di of each point to its corresponding class center, if 
d 2 /{ 1 + d 2 ) denoted as Vi is greater than namely, > v, keep this point; 
otherwise, remove this point. 

The intuition under this rule is that, in general the more discriminant 
information the point contains, the further it is from its center (unless it is a 
noise point). The inner justification under this rule is from [11]: d 2 /(1 + d 2 ) is 
the worst-case classification accuracy for future data, where d is the minimax 
Manhalanobis distance from the class center to the decision hyperplane. Thus 
removing those points with small z/s, namely, d 2 /{ 1 + d 2 ) will not affect 
the worst-case classification accuracy and will not greatly reduce the overall 
performance. 

Nevertheless, to cancel the negative impact caused by removing those 
points, we add the following global constraint: 

w T (x — y)> p{\Jw T S x w + \Jw T S y w) . (4.36) 

Integrating the above, we formulate the modified model as follows: 


max 


C Tm + Ty 

'y ' J + {N x + N y — r x 

fc=l 



s.t. ( w T Xi + b) > p{\/w T U x w) - £i, i = l,...,r x , 

~{w T y J +b)> p(^w T £ y w) - £ j+ra! , j = 1 ,. ..,r y , 

w T {x - y) > p{\/w T U x w + ^w T S y w) - £ m , 

£m 0; — 0) ^ * * • ) 4“ ^ y i 


where, £ m is the slack variable for the global constraint Eq.(4.36), £& are 
modified slack variables for the remaining data points, r x is the number of 
the remaining points for x , and r y is the number of the remaining points 
for y. 
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Remarks. An interesting observation from the above is that when we set the 
reduction threshold v to a larger value, or simply to the maximum value 1, the 
M 4 optimization degrades to the standard MPM optimization. This would 
imply that the above modified M 4 model contains a worst-case performance 
of MPM, if the incorporated local information is useful. 


4.4 Kernelization 

One may note that in the above, the classifier derived from M 4 is provided in 
a linear configuration. In order to handle nonlinear classification problems, 
in this section, we seek to use the kernelization trick [18] to map the n- 
dimensional data points into a high-dimensional feature space R-f, where a 
linear classifier corresponds to a nonlinear hyperplane in the original space. 

The kernel mapping can be formulated as: Xi —> ip{xf), yj —>• < p(yj ), 
where i = 1,..., N x , j = 1,. .., N y , and ip : R ra —> Rt is a mapping function. 
The corresponding linear classifier in R-f is 7 T (p{z) = b , where 7, ip(z) £ R^, 
and b £ R. 

The optimization of M 4 in the feature space can be written as: 


max p , 

p,7#o,z> 


(4.37) 

(7 + b) 

/- 

i = l,2,...,N x , 

(4.38) 




-h T <p(y 3 ) + b) 

—— - a 

fef 

CN 

T—1 

II 

(4.39) 





However, to make the kernel work we need to represent the optimization and 
the final decision hyperplane in a kernel form, K(z ly z 2 ) = (p(z 1 ) T (p(z 2 ), 
namely, an inner product form of the mapping data points. 

4.4.1 Foundation of Kernelization for M 4 

In the following, we demonstrate that the kernelization trick indeed works in 
M 4 , provided suitable estimates of means and covariance matrices are applied 
therein. 

Corollary 4.4. If the estimates of means and covariance matrices are given 
in M 4 as the following estimates: 
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N a Ny 

<p(x) = v(y) = - 

i =1 3 = 1 

T 

^3 <p{x^) Px^n "h ^ ' 4lj ^p(Xi) , 

z—l 

_ _ T 

s Ay) = Py 1 " + ^ - ^(y)) (<p(yj) - p(y)) . 

j=i 

where I n is the identity matrix of dimension n, then the optimal 7 in 
Eqs.(4-37)~(4-39) lies in the space spanned by the training points. 

Proof. We write 7 = 7 p + 7 d , where 7 p is the projection of 7 in the vector 
space spanned by all the training data points and 7 d is the orthogonal com¬ 
ponent to this span space. By using 7 Jip(xi) = 0 and 7 J<p(i/j) = 0 , one can 
easily verify that the optimization Eqs.(4.37)-(4.39) change to: 


max p , 

pd7 P .7d}^°> h 


s.t. 


-(7 pV^ + b) 


N x 


> P > 


1 7 p E A i(<p(xj) - <p(x))(<p(xi) - y{x)) T 7 + p x (7p7 p + laid) 


-(7 pv(yi) + b) 


Ny 


^ P, 


hi E - <p{y))(‘p(y j ) - v(y)) T 7 P + Pyiipip + ikid) 


3 =1 


where i = 1,, N x , j = 1,..., N y . Since we intend to maximize the margin 
p , the denominators in the above two constraints need to be as small as 
possible. This would lead to 7 d = 0. In other words, the optimal 7 lies in 
the vector space spanned by all the training data points. Note that the above 
discussion is assumed in the feature space. 


According to Corollary 4.4, if we use the plug-in estimates to approximate 
the means and covariance matrices, we can write 7 as a linear combination 
form of training data points: 

N „ Ny 

7 = vM x i) + v Mvj) > ( 4 - 4 °) 

i=1 3= 1 

where the coefficients Vj eE,i = l,..., N x , j = 1,..., N y . 


4.4.2 Kernelization Result 

We present the kernelization result as the following theorem. 
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Theorem 4.5. [Kernelization Theorem of M 4 ] The optimal decision hyper¬ 
plane for M 4 involves solving the following optimization problem: 


max p, 

P,r)=fio,b 


„ , VI n i ^ 

s-t. -=- > P, 

i = 1,2,.. 


V W^ r l T K x K x r 1 



~(r, T K j+Na! + b) ^ __ 

/ 1 ^ ~ T ~ — P ’ 

3 = 1,2,. 

Ny 





Proof. The theorem can easily be proved by simply substituting the plug-in 
estimations of means and covariances matrices and Eq.(4.40) into Eqs.(4.38)- 
(4.39). 


The optimal decision hyperplane can be represented as a linear form in 
the kernel space: 


N m Ny 

f 0 ) = Y^r)*iK(z,Xi) +'^2rj* Na!+i K(z,y i ) + K , 

i=i i=i 

where 77 * and 6 * are the optimal parameters obtained by the optimization 
procedure. The notations in the above are defined similar to Chapter 3. How¬ 
ever, for an easy reference, we also summarize them in Table 4.1. 


Table 4.1. Notations used in Kernelization 
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4.5 Experiments 

In this section, we present the evaluation results of M 4 in comparison with 
SVM and MPM on both synthetic toy datasets and real world benchmark 
datasets. SOCP problems are solved based on the general software named 
Sedumi [20, 21]. The covariance matrices are given by the plug-in estimates. 

4.5.1 Evaluations on Three Synthetic Toy Datasets 

We demonstrate the advantages of our approach in comparison with SVM 
and MPM in the following synthetic toy datasets first. 

As illustrated in Fig. 4.6, we generate two types of data with the same 
data orientations but different data magnitudes in Fig. 4.6 (a), while we gen¬ 
erate two types of data with the same data magnitudes but different data 
orientations in Fig. 4.6 (b). In (a), the x data are randomly sampled from 
the Gaussian distribution with the mean as [—3.5, 0] T and the covariance as 
[3, 0; 0, 4.5], while the y data are randomly sampled from another Gaussian 
distribution with the mean and the covariance as [3.5, 0] T and [1, 0; 0, 1.5] 
respectively. In (b), the x data are randomly sampled from the Gaussian dis¬ 
tribution with the mean as [—4, 0] T and the covariance as [1, 0; 0, 5], while 
the y data are randomly sampled from another distribution with the mean 
and the covariance as [4, 0] T and [1, 0; 0, 5] respectively. Moreover, to gener¬ 
ate different data orientation, in Fig. 4.6 the y data are rotated anti-clockwise 
at the angle of — g7r. In both (a) and (b), training (test) data consisting of 120 
(250) data points for each class are presented as o’s (+’s) and x’s (D’s) for x 
and y respectively. Observed from Fig. 4.6, M 4 demonstrates its advantages 
over SVM. More specifically, in Fig. 4.6 (a), SVM discards the information of 
the data magnitudes, whose decision hyperplane lies basically in the middle 
of boundary points of two types of data, while M 4 successfully utilizes this 
information, i.e. its decision hyperplane lies closer to the compact class {y 
data), which is more reasonable. Similarly, in Fig. 4.6 (b), M 4 takes advan¬ 
tage of the information of the data orientation, while SVM simply overlooks 
this information, which results in a lot of points incorrectly classified. 

In comparison of MPM with M 4 , since in the above two datasets the global 
information, i.e. the mean and the covariance can be reliably estimated from 
data, they achieve similar performance. To see the difference between M 4 and 
MPM, we generate another dataset as illustrated in Fig. 4.7, where we inten¬ 
tionally generate a very small number of training data, i.e. only 20 training 
points. Similarly, the data are generated under two Gaussian distributions: 
the x data are randomly sampled from the Gaussian distribution with the 
mean as [—3, 0] T and the covariance as [0.5, 0; 0, 8], while the y data are 
randomly sampled from another distribution with the mean and the covari¬ 
ance as [4, 0] T and [6, 0; 0, 1] respectively. Training data and test data 
are represented using similar symbols to Fig. 4.6. From Fig. 4.7, once again 
M 4 achieves ideal decision boundary which considers data both locally and 
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(b) ' 


Fig. 4.6. The first two synthetic toy examples to illustrate M 4 . Training 
(test) data consisting of 120 (250) data points for each class are presented as 
o’s (+’s) and x’s (D’s) for x and y respectively. Subfigure (a) demonstrates 
that SVM omits the data compactness information and (b) demonstrates 
that SVM discards the data orientation information, while M 4 achieves 
ideal decision boundary which considers data both locally and globally 


globally; whereas SVM obtains local boundary just in the middle of the sup¬ 
port vectors, which discards the global information, namely the statistical 
“trend” of data occurrence. For MPM, its decision hyperplane is exclusively 
dependent on the mean and covariance matrices. Thus we can see that this 
hyperplane coincides with the data shape, i.e. the long axis of training data of 
x is nearly in the same direction as the MPM decision hyperplane. However, 
the estimated mean and covariance are inaccurate due to the small number 
of data points. This results in a relatively lower test accuracy as illustrated 
in Fig. 4.7(b). In comparison, M 4 incorporates the information of the local 
points to neutralize the effect caused by inaccurate estimations. The test ac- 
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(a)' 



(b ) 1 


Fig. 4.7. The third synthetic toy example to illustrate M 4 . Training (test) 
data, consisting of 20 (60) data points for each class are presented as o’s 
(+’s) and x’s (CPs) for x and y respectively. Subfigure (a) demonstrates 
the decision boundaries derived from training data, while (b) illustrates 
the performance of these hyerplanes on the test set. The M 4 achieves ideal 
decision boundary which considers data both locally and globally 


curacies for the above three toy datasets listed in Table 4.2 also demonstrate 
the advantages of M 4 . 

4.5.2 Evaluations on Benchmark Datasets 

We perform evaluations on seven standard datasets. Data for Twonorm prob¬ 
lem are synthetically generated according to [3]. The remaining six datasets 
are real world data obtained from the UCI machine learning repository [2]. 
We compared M 4 with SVM and MPM engaging with both the linear and 
Gaussian kernels. The parameter C for both M 4 and SVM was tuned via 
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Table 4.2. Comparisons of classification accuracies between M 4 , SVM, 
and MPM on the toy datasets 


Dataset 

Classification accuracy (%) 


M 4 

SVM 

MPM 

I(%) 

98.8 

96.8 

98.8 

II(%) 

98.8 

97.2 

98.8 

III(%) 

98.3 

97.5 

95.8 


cross validations [9], so was the width parameter in the Gaussian kernel for 
all three models. The final performance results were obtained via the 10-fold 
cross validation. Table 4.3 summarizes the evaluation results. 


Table 4.3. Comparisons of classification accuracies among M 4 , SVM, and MPM 


Dataset 

Classification accuracy 

of linear kernel(%) 

Classification accuracy 

of Gaussian kernel(%) 


M 4 

SVM 

MPM 

M 4 

SVM 

MPM 

Twonorm 

96.5 ±0.6 

95.1 ± 0.7 

97.6 ± 0.5 

96.5 ± 0.7 

96.1 ± 0.4 

97.6 ± 0.5 

Breast 

97.5 ± 0.7 

96.6 ±0.5 

96.9 ± 0.8 

97.5 ± 0.6 

96.7 ± 0.4 

96.9 ± 0.8 

Ionosphere 

87.7 ± 0.8 

86.9 ±0.6 

84.8 ± 0.8 

94.5 ± 0.4 

94.2 ±0.3 

92.3 ± 0.6 

Pima 

77.7 ±0.9 

77.9 ± 0.7 

76.1 ± 1.2 

77.6 ± 0.8 

78.0 ± 0.5 

76.2 ± 1.2 

Sonar 

77.6 ± 1.2 

76.2 ± 1.1 

75.5 ± 1.1 

84.9 ± 1.2 

86.5 ± 1.1 

87.3 ± 0.8 

Vote 

96.1 ± 0.5 

95.1 ± 0.4 

94.8 ± 0.4 

96.2 ± 0.5 

95.9 ±0.6 

94.6 ±0.4 

Heart-disease 

86.6 ± 0.8 

84.1 ± 0.7 

83.2 ± 0.8 

86.2 ± 0.8 

83.8 ±0.5 

83.1 ± 1.0 


From the results we observe that M 4 achieves the best overall perfor¬ 
mance. In comparison with SVM and MPM, M 4 wins five cases in the linear 
kernel and four in the Gaussian kernel. The evaluations on these standard 
bench-mark datasets demonstrate that it is worth considering data both lo¬ 
cally and globally, which is emphasized in M 4 . Inspecting the differences 
between M 4 and SVM, the kernelized M 4 appears marginally better than 
the kernelized SVM, while the linear M 4 demonstrates a distinctive advan¬ 
tage over the linear SVM. This phenomenon may be explained on two hands. 
On one hand, this can be explained from the fact that the data points are 
very sparse in the kernelized space or feature space (compared with the huge 
dimensionality in the Gaussian kernel). Thus the plug-in estimates of the 
covariance matrices may not accurately represent the data information in 
this case. On the other hand, it is well-known that the kernelization will not 
keep the structure information in the feature space. One direct consequence 
is that maximizing the margin in the feature space does not necessarily max- 
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imize the margin in the original space [23]. Therefore, without building some 
connections between the original space and the feature space, utilizing the 
structure information, e.g. covariance matrices in the feature space seems not 
to do much help in this sense. Inspecting these two points, one interesting 
topic in the future is to consider forcing constraints on the mapping function 
so as to maintain the data topology in the kernelization process. 

In the above, we do not perform the reduction on these datasets. To illus¬ 
trate how the reduction algorithm works for decreasing the computation time 
while maintaining the test accuracy, we implement it on the Heart-disease 
dataset. We perform the reduction in training sets and then keep test sets un¬ 
changed. We repeat this process for different thresholds v. We then plot the 
curve of the cross validation accuracy against the threshold v. Moreover, we 
also plot the curve of the computation time against the threshold. This can 
be seen in Fig. 4.8. From this figure, we can see that both that the computa¬ 
tion time and the test accuracy change insensitively against v when v is set 
to some small values, e.g. v < 0.7. If looking into the Heart-disease dataset, 
we find that most data points are far away from their corresponding class 
center in terms of the Manhalanobis distance. Thus setting small values to v 
does not actually reduce many data points. This generates both a relatively 
flat changing curve in the test accuracy and the computation time in this 
range. As v is changing larger, the computation time decreases fast as more 
and more data points are removed, while the test accuracy goes down slowly. 
When the threshold is set to 1, the M 4 degrades to the MPM model, yielding 
the test accuracy of M 4 achieves the same value of MPM. This demonstrates 
how the proposed reduction algorithms can decrease the computation time 
while maintaining good performance. When used in practice, the threshold 
can be set according to the required response time. 



i —i—,—.—,— j —.—.—.- 
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Threshold (»•) 

(a) Test accuracy vs. threshold 



Threshold (v) 

(b) Running time vs. threshold 


Fig. 4.8. Reduction on the Heart-disease dataset 
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4.6 Discussions and Future Work 

We will discuss several important issues in this section. First, although M 4 
can be solved in polynomial time, the large computation time is still one of its 
biggest limitations. This may cause problems especially in its kernelization 
version. Note that the proposed reduction algorithm in this chapter does 
not completely solve this problem, since removing points will inevitably lose 
information. In this sense, it is crucial to develop some special algorithms 
for M 4 . Due to the sparsity of M 4 (it also contains support vectors), it is 
therefore very interesting to investigate whether decomposable methods or 
an analogy to the Sequential Minimal Optimization [16] designed for SVM 
can also be applied in training M 4 . We believe that there is much to obtain 
from such explorations. Certainly, this is a highly worthy research direction 
in the future. 

Second, although we have derived an error bound for M 4 , digging out 
the direct connection or performing empirical comparison of this bound with 
those of its special cases is still interesting, namely, SVM and MPM maintains 
an interesting problem. Especially, it is an open problem whether there exists 
a unified form of the bounds for M 4 , SVM, and MPM. This interesting subject 
deserves future deep explorations. 

Third, since in this chapter we mainly discuss M 4 for two-category clas¬ 
sifications, how to extend its application to multi-way classifications is also 
an important topic in the future. 


4.7 Summary 

Local learning approaches, e.g. large margin machines have demonstrated 
their advantages in machine learning and pattern recognition. However, they 
derive the decision boundary only in a local way. For example, the most pop¬ 
ular large margin classifier, Support Vector Machine obtains the decision hy¬ 
perplane by focusing on considering some critical local points called support 
vectors, while discarding all other points; on the other hand, global learning 
models (e.g. Minimax Probability Machine) obtain the classifier only based 
on global information, i.e. the mean and covariance information in MPM, 
while ignoring all individual local points. Differently, our proposed model 
is constructed based on both domestic and global view of data. This new 
model is theoretically important in the sense that SVM and MPM can both 
be considered as its special cases. Furthermore, the optimization of M 4 can 
be cast as a sequential Conic Programming problem which can be solved in 
polynomial time. 

We have provided a clear geometrical interpretation, and established de¬ 
tailed connections among our model and other models such as Support Vector 
Machine, Minimax Probability Machine, Fisher Discriminant Analysis, and 
Minimum Error Minimax Probability Machine. We have also shown to exploit 


94 


References 


Mercer kernels to extend our model to build up nonlinear decision bound¬ 
aries. In addition, we have also proposed a reduction method to decrease 
the computation time. Experimental results on both synthetic datasets and 
real world benchmark datasets have demonstrated the advantages of M 4 over 
Support Vector Machine and Minimax Probability Machine. 
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Extension I: BMPM for Imbalanced Learning 


In this chapter, we consider the imbalanced learning problem. This problem 
means the task of binary classification on imbalanced data, in which nearly 
all the instances are labeled as one class, while far fewer instances are labeled 
as the other class, usually the more important class. Traditional machine 
learning methods seeking accurate performance over a full range of instances 
are not suitable to deal with this problem, since they tend to classify all 
the data into the majority class, usually the less important class. Moreover, 
many current methods have tried to utilize some intermediate factors, e.g. 
the distribution of the training set, the decision thresholds or the cost matrix, 
to impose a bias towards the important class. However, it remains uncertain 
whether these roundabout methods can improve the performance in a sys¬ 
tematic way. In this chapter, we apply Biased Minimax Probability Machine, 
one of the special cases of Minimum Error Minimax Probability Machine to 
deal with the imbalanced learning tasks. Different from previous methods, 
this model achieves in a worst-case scenario to derive the biased classifier by 
directly controlling the classification accuracy on each class. More precisely, 
BMPM builds up an explicit connection between the classification accuracy 
and the bias, which thus provides a rigorous treatment on imbalanced data. 
We examine different models and compare BMPM with three other com¬ 
petitive methods, i.e. the Naive Bayesian classifier, the fc-Nearest Neighbor 
method, and the decision tree method C4.5. The experimental results demon¬ 
strate the superiority of this model. 

This chapter is organized as follows. In the next section, we briefly present 
an introduction to the imbalanced learning. We then reiterate in a tight 
version the theoretical foundation of this chapter, namely the BMPM model. 
Following that in Section 5.3 we apply the BMPM model to deal with the 
imbalanced learning tasks. In Section 5.4, we evaluate the BMPM model 
based on a series of experiments, and in Section 5.5, we make discussions and 
present future work. Finally, we summarize this chapter in Section 5.6. 
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5.1 Introduction to Imbalanced Learning 

Learning classifiers from imbalanced or skewed datasets is an important topic, 
arising very often in practice in classification problems. In such problems, 
almost all the instances are labeled as one class, while far fewer instances 
are labeled as the other class, usually the more important class. It is obvious 
that traditional classifiers seeking accurate performance over a full range of 
instances are not suitable to deal with imbalanced learning tasks, since they 
tend to classify all the data into the majority class, which is usually the less 
important class. 

To cope with imbalanced datasets, there are types of methods such as 
the methods of sampling [4, 22, 15], the methods of moving the decision 
thresholds [26, 29], and the methods of adjusting the cost matrix [3, 26]. 
The first school of methods aims to reduce the data imbalance by “down- 
sampling” (removing) instances from the majority class or “up-sampling” 
(duplicating) the training instances from the minority class or both. The 
second school of methods tries to adapt the decision threshold to impose a 
bias on the minority class. Similarly, the third school of methods improves 
the prediction performance by adjusting the weight (cost) for each class. 

A common problem for all the three families of methods is that they lack 
a rigorous and systematic treatment on imbalanced data. For the sampling 
method, either up- or down-sampling is unsuitable: up-sampling will intro¬ 
duce noise, while down-sampling the data will lose information. Moreover, 
to incorporate a good bias, it is usually difficult to know what a proportion 
should be sampled. For these reasons, Provost stated it as an open problem 
whether simply varying the skewness of the data distribution can improve 
prediction performance systematically [29]. For the method of adjusting the 
cost matrix or adapting weights, similar problems are also encountered, i.e. 
they are hard to build direct connections between the cost matrix or the 
weights and the biased classification quantitatively. To impose a suitable 
bias towards the important class, they have to adapt these factors by trials. 
Therefore, these methods cannot rigorously handle imbalanced data. 

In this chapter, we apply Biased Minimax Probability Machine (BMPM) 
to handle the tasks of learning from imbalanced data. Different from the sam¬ 
pling methods, BMPM does not remove or duplicate data. When compared 
with the methods of changing the thresholds or weights, our model builds 
up an explicit connection between the classification accuracy and the bias. 
It thus offers an elegant way to incorporate the bias into classification by 
directly controlling the real accuracy. 


5.2 Biased Minimax Probability Machine 

Suppose two random n-dimensional vectors x and y represent two classes of 
data, where x belongs to the family of distributions with a given mean x 
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and a covariance S x , denoted as x ~ (x,S x )] similarly, y belongs to the 
family of distributions with a given mean y and a covariance 2J y , denoted 
as y ~ (y, S y ). Here x , y, x, y G R™, and S x , S y G R raxn . In this chapter, 
the class x also represents the important or minority class and the class y 
represents the corresponding less important or majority class. 

The Biased Minimax Probability Machine can be described as follows 1 : 


max a , 

a, / 3 , 6 , 


s.t. inf Pr{w T x > b} > a , 

x~(x,£ x ) 

(5.1) 

inf Pr{w T y < b} > (3 , 

y~(y,E y ) 

(5.2) 

f3 > (3 o ■ 

(5.3) 


Here a means the lower bound of the probability (accuracy) for the classifi¬ 
cation of future cases of the class x with respect to all distributions with the 
mean and covariance as {x, £ x )', in other words, a is the worst-case accuracy 
for the class x. Similarly, f3 is the lower bound of the accuracy of the class y. 
This optimization achieves to maximize the accuracy (the probability a) for 
the biased class x while simultaneously maintaining the class y’s accuracy at 
an acceptable level f3o by setting a lower bound as Eq.(5.3). In comparison, 
the Minimax Probability Machine (MPM) in [16, 17] considers the balanced 
dataset; therefore, it makes a equal to f3. 

This optimization setting seems to be more useful in incorporating a bias 
into classifications for imbalanced learning problems. A typical example can 
be seen in the epidemic disease diagnosis problem which is usually an imbal¬ 
anced classification problem as well. The “ill” cases are usually much fewer 
than the healthy cases. However, misclassification of the “ill” class results in 
more serious consequence than misclassification of the “healthy” case. Thus 
an unequal treatment on different classes is obviously necessary. 

We summarize the advantages of our biased model in the following. First, 
this method provides a different treatment on different classes, i.e. the hy¬ 
perplane w* T z = b* given by the solution of this optimization favors the 
classification of the important class x over the less important class y. Sec¬ 
ond, given reliable mean and covariance matrices, the derived decision hy¬ 
perplane is directly associated with two real accuracy indicators, i.e. a and 
/3, for each class. Thus, by varying the lower bound of (3 , i.e. j3o and deriving 
the corresponding classifier, we can quantitatively incorporate a bias into the 
classification. Third, this model contains a distribution-free feature. With no 
distribution assumption on data, the derived hyperplane seems to be more 
general and valid than a large family of classifiers, namely the generative clas¬ 
sifiers [10, 12] including the Naive Bayesian classifier [18], which has to make 

1 Note that, for easy explanations, the model description is in the slightly differ¬ 
ent but essentially the same form as the one introduced in Chapter 3. 
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specific distribution assumptions. Fourth, as shown shortly in Section 5.3, ei¬ 
ther we can simply modify this BMPM optimization to automatically search 
the best /3o in terms of some standard criteria, or slightly different from the 
current setting, we can quantitatively generate the trade-off curve between 
the accuracies on different classes and leave the task of choosing the best flo 
to the users. Finally, although the BMPM contains the above advantages, it 
does not trade them for efficiency. It is shortly shown that the optimization of 
BMPM can be cast as a Fractional Programming (FP) problem and thus can 
be solved efficiently. In short, with these important features, BMPM appears 
to offer a more direct and rigorous scheme to handle biased classification 
tasks, especially the imbalanced classifications, where the importance or cost 
for each class is unequal. 


5.3 Learning from Imbalanced Data by Using BMPM 

In this section, we apply the novel BMPM model to the tasks of learning from 
imbalanced data. We first review four standard imbalanced learning criteria, 
then based on two of them, we apply BMPM to the imbalanced learning 
tasks. 

5.3.1 Four Criteria to Evaluate Learning from Imbalanced Data 

In general, four criteria are used to evaluate the imbalanced learning. They 
are (1) the criterion of Minimum Cost (MC), (2) the criterion of Maximum 
Geometry Mean (MGM) of the accuracies on the majority class and the 
minority class, (3) the criterion of the Maximum Sum (MS) of the accuracies 
on the majority class and the minority class, and (4) the criterion of Receiver 
Operating Characteristic (ROC) analysis. We review these criteria as follows. 

Aiming to solve the problems caused by maximizing the accuracy over a 
full range of data, instead, Grzymala-Busse, et al. [9] maximized the sum of 
the accuracies on the minority class and the majority class (or maximized 
the difference between the true positive and false positive accuracy). This 
criterion is also widely used in other fields, e.g. graph detection, especially line 
detection and arc detection, where it is called Vector Recovery Index [6, 23]. 
Similarly, Kubat, et al. [14] proposed to use the geometric mean instead 
of the sum of the accuracies. However, compared to maximizing the sum, 
this criterion has a nonlinear form, which is not easy to be automatically 
optimized. On the other hand, when the cost of misclassification is known, a 
minimum cost measure defined as Eq.(5.4) should be used [2]: 

Cost = F p ■ C Fp + F n ■ C Fn , (5.4) 

where F p is the number of the false positive, C Fp is the cost of a false positive, 
F n is the number of the false negative, and C Fn is the cost of a false nega¬ 
tive. However, because the cost of misclassification is generally unknown in 
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real cases, the usage of this measure is somewhat restricted. Considering this 
point, some researchers introduced the ROC analysis [25, 26, 34]. This crite¬ 
rion plots a so-called ROC curve to visualize the tradeoff between the false 
positive rate and the true positive rate and leaves the task of the selection 
of a specific tradeoff to the practitioners. Fig. 5.1 illustrates an artificially 
generated ROC curve. It has been suggested that the area beneath an ROC 
curve can be used as a measure of accuracy in many applications [30, 33]. 
Thus, a good classifier for imbalanced learning should have a larger area. 



False positive rate 


Fig. 5.1. An artificially generated Receiver Operating Character¬ 
istic (ROC) curve 


Based on the above review, in this chapter we will focus on using the 
criterion of MS and the ROC curve analysis to evaluate the classifiers. 

5.3.2 BMPM for Maximizing the Sum of the Accuracies 

In the following, we first modify the formulation of BMPM to maximize the 
sum of the accuracies for two classes. Next, we make an analysis on the 
solvability of the modification version. Finally, we present the optimization 
method. 

5.3.2.1 Model Modification 

When using BMPM for the criterion of MS, we can modify the formulation 
of BMPM as follows: 


max (a + P) , 

(5.5) 

a,/3,b,w^O 

s.t. inf Pr{w T x > b} > a , 

(5.6) 


inf Pr{w T y < b} > /3 . 

(5.7) 


y~{y,E v } 
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The above formulate directly maximize the sum of the lower bounds of the 
accuracies so as to maximize the sum of the accuracies. In comparison, to 
achieve the maximum sum of the accuracies, some other approaches, e.g. the 
methods of sampling or the methods of adapting the weights have to search 
the best sampling proportion or the best weights by trials, which are in 
general very time-consuming. Since the above optimization is in fact nearly 
the same as the Minimum Error Probability Machine, it can be similarly 
solved by the Sequential Biased Minimax Probability Machine optimization 
method as introduced in Chapter 3. We thus do not elaborate it here. 


5.3.3 BMPM for ROC Analysis 

It is straightforward to apply the BMPM model to plot the ROC curve, since 
the lower bounds a and (3 directly and quantitatively control the accuracies 
for two classes. We only need to adapt the acceptable level for /3, namely 
/3o, from 0 to 1, to obtain a sequence of trade-offs between the accuracies 
of the important class and the negative class. We address that again, since 
/3 0 represents the lower bound of the accuracy of the less important class, 
varying /3q provides a direct and quantitative way to move the decision plane 
with different trade-offs. Directly associating accuracies with the moving of 
the hyperplane while assuming no distribution is one of advantages of BMPM 
over the other methods by adapting the weights or thresholds. 


5.4 Experimental Results 

In this section, we first illustrate the BMPM model with a toy example, 
and then evaluate the performance of BMPM on two real world imbalanced 
datasets, namely the recidivism dataset and the rooftop dataset in compari¬ 
son with the Naive Bayesian (NB) classifier, the ^-Nearest Neighbor (fc-NN) 
method [1], and the decision tree classifier C4.5 [31]. 


5.4.1 A Toy Example 

We present a toy example to illustrate the BMPM model in this section. 
Suppose 15 data points of the class x are generated from a 2D Gaussian 
distribution with the mean and covariance matrix as x = [0 1.5] T and S x = 
[0.5 0; 0 0.5] and 65 data points of the class y from another 2D Gaussian 
distribution with y = [0 0] T and E y = [0.5 0; 0 0.5]. 

By adapting the lower bound accuracy /3 0 for the class y, with optimizing 
the corresponding BMPM, we obtain a series of decision boundaries for the 
toy example when using the Gaussian kernel e - ^ -3 ^ with the parameter a 
as 5. These boundaries are illustrated in Fig. 5.2. Gray regions are classified as 
the class x represented by +’s, whereas those outside gray regions are judged 
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-2.0-1.5-1.0-0.5 0 0.5 1.0 1.5 2.0 
(a) 7 = 0.10 



(b) y-0.60 




(c) 7=0.95 


(d) 7=0.99 


Fig. 5.2. A toy example to illustrate BMPM. Data of the class x is plotted 
as +’s, and data of class y as CPs. The gray area represents the classification 
region of the class x, while the area outside the gray region is classified as 
the class y 


as the class y plotted as CPs. It is clear to observe that the lower bound /3 0 
directly controls the accuracy of the class y. More specifically, when /3 0 is set 
to small values such as 10.00%, 60.00% and 95.00%, the boundary is biased 
towards the class x. When /3 0 is set to larger values such as 99.00%, the 
classification is biased towards the class y. Moreover, Table 5.1 demonstrates 
that the lower bounds /3q and a can serve as the accuracy indictors. It is 
observed that these lower bounds keep well, i.e. the corresponding accuracies 
are slightly higher than the lower bounders except in the case when /3q = 
0.95. The exception, i.e. that the value of a , 99.16% is greater than the real 
accuracy 93.33%, is understandable due to the relatively smaller number of 
training samples: one single misclassification will influence the classification 
results significantly. This toy example demonstrates that by changing /3 0 , 
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Table 5.1. Lower bounds of accuracies, ct,f3 0 and the real accuracies 


A>(%) 

True negative rate(%) 

a(%) 

True positive rate(%) 

10.00 

13.85 

100.00 

100.00 

60.00 

63.08 

100.00 

100.00 

95.00 

95.38 

99.16 

93.33 

99.00 

100.00 

81.94 

86.67 


BMPM provides an elegant and direct way to incorporate the bias into the 
classification. 

5.4.2 Evaluations on Real World Imbalanced Datasets 

In this section, we evaluate our novel BMPM model in comparison with three 
competitive classification methods, namely the Naive Bayesian classifier, the 
fc-Nearest Neighbor methods and the decision tree C4.5, on two real world 
imbalanced datasets, the recidivism dataset and the rooftop dataset. Before 
we go into the experimental details, we first introduce these three techniques 
and adapt them to learn from imbalanced datasets according to previous 
research results [20, 26]. 

5.4.2.1 Modifying Three Learning Techniques 

We investigate and modify three learning techniques, the Naive Bayesian 
classifier, the fc-Nearest Neighbor method, and the decision tree C4.5 in the 
following. 

The Naive Bayesian classifier [11, 18] is proposed based on a very sim¬ 
ple assumption, i.e. each attribute is conditionally independent of each 
other when given the class variable. The decision in a two-category predic¬ 
tion task is made according to the calculation of the posterior probability 
p(C\z), where C is the class variable and z represents the observation. When 
p(C\\z) > 0.5 or another equivalent yet more convenient rule is satisfied, 
i.e. p{C\)p{z\C\) > p{C 2 )p{z\C 2 ), 2 is classified into Cft; otherwise, it is 
judged as C 2 ■ Even with the strong conditional independency assumption, 
the Naive Bayesian classifier demonstrates a surprisingly good performance 
when compared with state-of-the-art classifiers [8, 19] such as Support Vector 
Machines [35] and C4.5 in many domains. By simply introducing a parameter 
t into the decision rule p(Ci)p(z\Ci) > Tp(C 2 )p(z\C 2 ), Naive Bayesian clas¬ 
sifiers can be adapted to the imbalanced learning. For example, specifying 
t < 1 imposes a bias towards the C\ class, whereas specifying r > 1 imposes 
a bias towards the C 2 class. 

In the fc-Nearest Neighbor classification [1], based on some distance mea¬ 
sure, e.g. the Euclidean distance measure, k data points, which are the clos¬ 
est to the query point, are selected out. It then labels the query point as 
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the most frequent class among the chosen k points. Although this method is 
very simple and may suffer from difficulties in high dimensions, it achieves 
satisfactory performance in many real domains. Following [26], we alter the 
distance measure Sj for the class Cj to handle imbalanced learning tasks 
according to Eq.(5.8): 


Sj = d E {z , Zj) - Tjd E (z , Zj) , (5.8) 

where Zj is the closest point from class Cj to the query point, and d E (z,Zj) 
represents the Euclidean distance measure. Similar to the Naive Bayesian 
classifier, by modifying Tj the Nearest Neighbor method can build biased 
classifiers. 

C4.5 is a kind of algorithm introduced by Quinlan for inducing classi¬ 
fication models, also called decision trees, from data [31]. By selecting the 
attributes according to the gain ratios criterion, an information measure of 
homogeneity, C4.5 builds up a decision tree where each path from the root 
to a leaf represents a specific classification rule. We adapt C4.5 to learn from 
imbalanced dataset based on the similar method to [26], i.e. by changing the 
prior probability to bias the classification. 


5.4.2.2 Evaluations on the Recidivism Dataset 

The recidivism dataset was obtained from a cohort of releases of the North 
Carolina prison system during the time period from July 1, 1977 to June 
30, 1978. There are totally 4,618 individuals in this dataset, including a 
training set with 1,540 individuals and a test set with 3,078 individuals. In 
the training set, 570 (27.5%) individuals were recidivists and 970 (72.5%) were 
not. In the test set, 1,151 individuals were recidivists and 1,927 were not. 
Although this dataset is not skewed as severely as other reported datasets, 
for example, the fog dataset [28] and the rooftop dataset used in the next 
subsection, it is enough to use this dataset to evaluate the performance of 
the imbalanced learning [26]. 

We use the same processing method [32] to select and scale nine attributes 
that appear in Table 5.2, while six other attributes are dropped based on an 
insignificant test at the 5% level. 

We compare the performance of our proposed Biased Minimax Proba¬ 
bility Machine model, in both the linear (BMPML) and the Gaussian kernel 
setting (BMPMG), with the Naive Bayesian classifier, C4.5 and the fc-Nearest 
Neighbor method. These methods are modified into the imbalanced learning 
according to the methods introduced in the previous section. We run fc-NN 
methods for k = 1,3, 5,..., 21, but we only present the best three results 
for brevity. The width parameter for the Gaussian kernel is tuned via cross 
validation methods [13]. 

We first present the experimental results based on the MS criterion in 
Table 5.3. To be more comparable, we show the average of the accuracy for 
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Table 5.2. Attribute description in the recidivism dataset 


Attribute 

Description 

TSERVED 

Time served (in months) 

AGE 

Age (in months) at the time of release 

PRIORS 

Number of previous incarcerations 

WHITE 

Is the individual Caucasian? 

FELON 

Was the sentence for a felony? 

LCHY 

Does individual’s record indicate a serious problem with alcohol? 

JUNKY 

Does individual’s record indicate a serious problem with hard drugs? 

PROPTY 

Was individual’s sentence for a crime against property? 

MALE 

Is the individual male? 


each class when each classifier attains the point of the maximum sum. The 
BMPML achieves an average accuracy of 0.6391 and the BMPMG achieves an 
average accuracy of 0.6490, while the highest average accuracy among other 
classifiers is given as 0.6272 by NB. Therefore, in this dataset, BMPML and 
BMPMG outperform other methods in terms of the MS criterion. 


Table 5.3. Performance on a recidivism prediction task based on the MS 
criterion 


Method 

True negative rate 

True positive rate 

(True positive rate+true negative rate)/2 

NB 

0.6177 

0.6377 

0.6272 

fc-NN(9) 

0.6255 

0.5464 

0.5860 

fc-NN(ll) 

0.6238 

0.5542 

0.5890 

fc-NN(13) 

0.5569 

0.6201 

0.5885 

C4.5 

0.7405 

0.4900 

0.6153 

BMPML 

0.7037 

0.5745 

0.6391 

BMPMG 

0.7203 

0.5778 

0.6490 


Let us next present the experimental results based on the ROC analy¬ 
sis. By setting the thresholds or costs by trials for NB, fc-NN, and C4.5, the 
ROC curves are generated with good shapes as evenly distributed along their 
length as possible. As discussed in [26], although this generation method may 
increase the running time for some methods, e.g. A;-NN, it works well in C4.5 
and NB and is sufficient to evaluate the performance of imbalanced learning. 
For the BMPM model, since the lower bound (3q serves as the accuracy in¬ 
dicators, we simply vary it from 0 to 1 to generate the corresponding ROC 
curve. The ROC curves are shown in Fig. 5.3(a). As seen in this figure, the 
performances of BMPML and BMPMG are once again superior to those of 
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(a) The full range of ROC curves for recidivism 



(b) A critical proportion of ROC curves for recidivism 


Fig. 5.3. ROC curves for the recidivism dataset. Subfigure (a) 
shows a full range of the ROC curve, while (b) shows a critical 
proportion of the ROC curve, which is of more interest in real ap¬ 
plications. Both figures demonstrate the superiority of the BMPM 
model, since the curves of BMPML and BMPMG cover those of 
other models in most parts and thus have a larger area 


other methods, since their ROC curves cover those of other models in most 
parts. To quantitatively demonstrate the difference, in Table 5.4 we also show 
the areas beneath the ROC curves approximated by using the trapezoid rule. 
The BMPML and BMPMG show a consistent superiority to NB which is the 
best of the other three methods. 

In addition, in real applications not all the portions of the ROC curve are 
of great interest [27]. Usually, those with a small false positive rate and a high 
true positive rate should be more of interest and importance [36]. We thus 
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Table 5.4. Performance on a recidivism prediction task 
based on the area of ROC curve 


Method 

Area under ROC curve 

NB 

0.6646 

fc-NN(ll) 

0.6155 

fc-NN(13) 

0.6189 

fc-NN(17) 

0.6148 

C4.5 

0.6383 

BMPML 

0.6842 

BMPMG 

0.6798 


especially show the portion of the ROC curve in the range when the false 
positive rate FPg [0, 0.5] and the true positive rate TPg [0.5, 1]. As shown 
in Fig. 5.3(b), in this range, the superiority of the BMPL and BMPMG is 
more obvious than the whole ROC curve analysis. This again demonstrates 
our model’s advantages over other methods. 

5.4.2.3 Evaluations on the Rooftop Dataset 

The rooftop dataset consists of 17, 829 overhead images of Fort Hood, Texas, 
collected as part of the RADIUS project [7], which are of a military base. 
Depending on whether they are buildings (with a detected rooftop) or not, 
781 images in this dataset are labeled as positive examples while 17,048 
images are labeled as negative examples. It is clearly observed that this is 
a severely skewed dataset. According to [7, 26], these images were taken 
from two different viewpoints, i.e. a nadir aspect and an oblique aspect and 
covered three different areas. Following [21, 26], we represent each of these 
images in nine continuous attributes which are extracted based on various 
image analysis. The detailed information about this dataset is summarized 
in Tables 5.5 and 5.6. 


Table 5.5. Description of images in the rooftop dataset 


Sub-dataset 

Location 

Image size 

Aspect 

^Positive 

^Negative 

1 

A 

2055 x 375 

Nadir 

71 

2645 

2 

A 

1803 X 429 

Oblique 

74 

3349 

3 

B 

670 X 645 

Nadir 

197 

982 

4 

B 

704 x 568 

Oblique 

238 

1955 

5 

c 

1322 x 642 

Nadir 

87 

3722 

6 

c 

1534 x 705 

Oblique 

114 

4395 
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Table 5.6. Description of the attributes in the rooftop dataset 


Attribute 

Description 

1 

Evaluation of the edge support 

2 

Evaluation of the corner support 

3 

Evaluation of the parallel support 

4 

Evaluation of the OTV (Orthogonal Trihedral Vertex) support 

5 

Evaluation of the shadow corner support 

6 

Evaluation of gap overlap 

7 

Evaluation of displacement of edge support 

8 

Evaluation of crossing lines on any side of the hypothesis 

9 

Evaluation of existence of T-junction or L-junction on any side 


We randomly split the rooftop data into a training set with 60% data and 
a test set with 40% data. We then construct classifiers from imbalanced data 
based on the training dataset and perform evaluations on the test dataset. 
We repeat this procedure ten times and use the average of the results as the 
performance metric. In such a setup, we compare our BMPM with other three 
approaches, i.e. NB, C4.5 and k- NN. Similar to the case in the recidivism 
dataset, NB, C4.5 and fc-NN are modified to handle imbalanced data. The 
width parameter a is chosen by cross validation methods again. Moreover, we 
still run £;-NN with fc = l,3,5,...,21 and present the best three for brevity. 

The results are summarized in Table 5.7 based on the MS criterion, and 


Table 5.7. Performance on the rooftop dataset based on the MS criterion 


Method 

True negative rate 

True positive rate 

(True positive rate + True negative rate)/2 

BMPML 

0.8015 ± 0.0058 

0.8231 ± 0.0063 

0.8123 ± 0.0060 

BMPMG 

0.7997 ± 0.0087 

0.8405 ± 0.0100 

0.8201 ± 0.0091 

fc-NN(7) 

0.7510 ± 0.0055 

0.8069 ± 0.0062 

0.7789 ± 0.0052 

fc-NN(13) 

0.7409 ± 0.0051 

0.8140 ± 0.0083 

0.7774 ± 0.0061 

fc-NN(15) 

0.7433 ± 0.0067 

0.8211 ± 0.0072 

0.7822 ± 0.0072 

NB 

0.7969 ± 0.0043 

0.8177 ± 0.0080 

0.8073 ± 0.0066 

C4.5 

0.8176 ± 0.0040 

0.7942 ± 0.0063 

0.8059 ± 0.0051 


Fig. 5.4 and Table 5.8 based on the ROC analysis. As is clearly observed, for 
both criteria, the BMPM method demonstrates its superiority to the other 
methods, since it has higher sums of the accuracies and larger areas under the 
ROC curves. Similar to what we do in the recivisim dataset, we also plot the 
more critical portion of the ROC curve in Fig. 5.4(b). The predominance of 
BMPML and the BMPMG is even more obvious. To evaluate the performance 
more reliably, we perform a significance test based on both LabMRMC [5, 24] 
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and a t-test. The analysis shows that the accuracies of BMPML and BMPMG 
are significantly different from those of other methods at P < 0.05, both in 
terms of the MS criterion and the ROC curve criterion. 



(a) The full range of ROC curves for rooftop 



False positive rale 

(b) A critical proportion of ROC curv es for rooftop 


Fig. 5.4. ROC curves for the rooftop dataset. We ran each method by 
randomly partitioning the dataset into a training dataset (60%) and a test 
dataset (40%). The evaluations were iterated 10 times. We then average 
the true positive rate and false positive rate to generate the ROC curves. 
Subfigure (a) shows a full range of the ROC curve, while (b) shows a critical 
proportion of the ROC curve, which is of more interest in real applications. 
Both figures demonstrate the superiority of the BMPML and BMPMG 
model to other models, since the curves of BMPML and BMPMG cover 
those of other models in most parts and thus have a larger area 
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Table 5.8. Performance on the rooftop dataset based on 
the area of ROC curve 


Method 

Area under ROC curve 

BMPML 

0.8791 ± 0.0061 

BMPMG 

0.8819 ± 0.0087 

fc-NN(9) 

0.8601 ± 0.0091 

fc-NN(ll) 

0.8569 ± 0.0058 

fcNN(15) 

0.8582 ± 0.0063 

NB 

0.8678 ± 0.0060 

C4.5 

0.8744 ± 0.0062 


5.4.3 Evaluations on Disease Datasets 

Diagnosing diseases contain a very similar characteristic to the imbalanced 
learning, since one class, usually the disease class needs to be given more bias 
than the other class. Therefore, the above discussed model modifications will 
be automatically applicable for this kind of tasks. In the following, we evalu¬ 
ate the performance of BMPM on two disease datasets, namely, the Breast- 
cancer dataset and the Heart-disease dataset, which are obtained from UCI 
machine learning repository. In the context of diagnosing diseases, the true 
positive rate is usually called sensitivity, while the true negative rate is called 
specificity. Therefore, we should maximize the sensitivity while maintaining 
the specificity acceptable. In the following, we present the experimental re¬ 
sults still compared with the best three, namely the modified Naive Bayesian 
classifier, fc-NN, and C4.5. We randomly split the data for each dataset into a 
training set with 80% data and a test set with 20% data. We then construct 
classifiers based on the training dataset and perform evaluations on the test 
dataset. We repeat this procedure ten times and use the average of the results 
as the performance metric. 

We present the results based on the MS criterion in Table 5.9 for the 
breast-cancer dataset and Table 5.10 for the heart disease dataset. Obsereved 
from these two tables, the BMPM model also demonstrates a superiority to 
other three models. In addition, the t -test also shows that the accuracies of 
BMPML and BMPMG are significantly different from those of other three 
classifiers at P < 0.05. 

We next present the experimental results based on the ROC analysis 
in Fig. 5.5(a) and Fig. 5.6(a). It is observed that BMPML and BMPMG 
perform better than other classifiers for both datasets, since in most parts 
the BMPM curves dominate those of other methods. More specifically, we 
calculate the areas under the ROC curves as illustrated in Table 5.11, based 
on the trapezoid rule. For the breast-cancer dataset, it produces a curve with 
an area of 0.9953 in the linear setting and a curve with an area of 0.9963 in 
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Table 5.9. Comparison of the model performance based on the 
MS criterion on the breast-cancer dataset 


Method 

Specificity 

Sensitivity 

(Specificity+Sensitivity)/2 

BMPML 

0.9684 ± 0.0029 

0.9872 ± 0.0015 

0.9778 ± 0.0021 

BMPMG 

0.9612 ± 0.0018 

0.9915 ± 0.0011 

0.9764 ± 0.0016 

fc-NN(ll) 

0.9900 ± 0.0047 

0.9620 ± 0.0034 

0.9760 ± 0.0029 

fc-NN(17) 

0.9862 ± 0.0081 

0.9664 ± 0.0058 

0.9762 ± 0.0050 

fc-NN(7) 

0.9721 ± 0.0071 

0.9752 ± 0.0049 

0.9737 ± 0.0058 

NB 

0.9366 ± 0.0059 

0.9719 ± 0.0049 

0.9543 ± 0.0051 

C4.5 

0.9378 ± 0.0074 

0.9582 ± 0.0067 

0.9480 ± 0.0072 


Table 5.10. Comparison of the model performance based on the 
MS criterion on the heart disease dataset 


Method 

Specificity 

Sensitivity 

(Specificity+Sensitivity)/2 

BMPML 

0.8549 ± 0.0042 

0.8158 ± 0.0013 

0.8354 ± 0.0035 

BMPMG 

0.8403 ± 0.0053 

0.8572 ± 0.0017 

0.8488 ± 0.0026 

fc-NN(17) 

0.7654 ± 0.0029 

0.8837 ± 0.0018 

0.8246 ± 0.0027 

fc-NN(7) 

0.7754 ± 0.0038 

0.8844 ± 0.0042 

0.8299 ± 0.0037 

fc-NN(15) 

0.7512 ± 0.0028 

0.8653 ± 0.0037 

0.8082 ± 0.0036 

NB 

0.7862 ± 0.0052 

0.8024 ± 0.0031 

0.7943 ± 0.0040 

C4.5 

0.8831 ± 0.0022 

0.7065 ± 0.0018 

0.7948 ± 0.0021 


the Gaussian kernel, whereas the fc-NN with k = 11 forms a curve with a 
smaller area equal to 0.9908, the best result of the fc-NN, NB and C4.5. For 
the Heart disease dataset, the BMPM shows a curve with an area of 0.8814 
in the linear setting and a curve with an area of 0.8932 in the Gaussian kernel 
setting. These two areas are both greater than those of the other methods, 
i.e. the fc-NN classifier, NB and C4.5. In summary, the evaluations based on 
the area of the ROC curve quantitatively demonstrate the superiority of our 
BMPM model for both datasets. 

In addition, as illustrated in Fig. 5.5(b) and Fig. 5.6(b), we show the 
critical portion of Fig. 5.5(a) and Fig. 5.6(a) respectively when the false 
positive rate is in the range of 0.0 to 0.5 and the true positive rate is in 
the range of 0.5 to 1.0. In this critical region, most parts of the ROC curves 
of BMPM cover the corresponding curves of other models in both datasets, 
which again demonstrates the superiority of the BMPM model. 
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Table 5.11. Comparison of the model performance based 
on the ROC analysis 


Method 

Area under ROC Curve 


Breast-cancer 

Heart 

BMPML 

0.9953 ± 0.0018 

0.8814± 0.0056 

BMPMG 

0.9963 ± 0.0016 

0.8932± 0.0043 

fc-NN(ll) 

0.9908 ± 0.0060 

0.8701 ± 0.0038 

fc-NN(17) 

0.9902 ± 0.0100 

0.8689± 0.0050 

fc-NN(7) 

0.9887 ± 0.0080 

0.8596 ± 0.0038 

NB 

0.9841 ± 0.0060 

0.8162 ± 0.0034 

C4.5 

0.9762 ± 0.0120 

0.8301± 0.0038 




False positive rale 


(b) A critical proportion of ROC' curves for breast-cancer 


Fig. 5.5. ROC curves for the breast-cancer dataset. The ROC 
curves of BMPML and BMPMG dominate those of other models 
and BMPMG yields the largest area under the ROC curve 
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(a) The lull range of ROC curves for heart disease 



(b) A critical proportion of ROC curves for heart disease 


Fig. 5.6. ROC curves for the heart disease dataset. The ROC 
curves of BMPML and BMPMG dominate those of other models 
and BMPMG yields the largest area under the ROC curve 


5.5 When the Cost for Each Class Is Known 

There exists cases in which the cost for each class can be given by experts. 
In the following, we show that the BMPM model can naturally be adapted 
to this type of tasks. 

Assuming x and y are the minority class and the majority class respec¬ 
tively, it is easily verified that minimizing the optimization function given by 
Eq.(5.4) is equivalent to maximizing the following formulation: 

max (r x K x + r y K y ) , 

where r x is the true positive rate or the accuracy of the class x, r y is the true 
negative rate or the accuracy of the class y, K x and K y are two constants 
which are equal to Cm N y and Cp n N x respectively (N x , N y are respectively 
the number of data points labeled as the classes x and y). Similar to the 













References 115 


optimization procedure of MS, we can naturally modify the BMPM model in 
the following formulation: 


max ( K x a + K y 0) , 

cx.,( 3 ,b, 107^0 

s.t. inf Pr{w T x > b} > a , 

inf Pr{w T y < b} > B . 
y~{i ?>■=,} ~ 

The above optimization derives the classification boundary by maximizing 
the weighted lower bound of the real accuracies or the weighted worst-case 
real accuracies so as to minimize the overall classification risk. Moreover, 
similar to the MS case, it is easily validated that this optimization problem 
can be cast as a sequential BMPM problem. Hence, it can similarly be solved 
based on the method presented in Chapter 3. 


5.6 Summary 

In this chapter, we have applied a novel model named Biased Minimax Prob¬ 
ability Machine to deal with the task of learning from imbalanced datasets. 
Given reliable estimation of the mean and covariance of data, this model con¬ 
structs the classification boundary by directly controlling the lower bound of 
the real accuracy and thus provides a systematic and rigorous treatment 
on skewed data. We have evaluated the BMPM model on two real world 
imbalanced datasets and two disease datasets in terms of two criteria. In 
both criteria, the performances are shown to be the best when compared 
with other competitive methods such as the Naive Bayesian classifier, the 
fc-Nearest Neighbor method, and the decision tree classifier, C4.5. 
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Extension II: A Regression Model from M 4 


In this chapter, we present a novel regression model which is directly moti¬ 
vated from the Maxi-Min Margin Machine(M 4 ) model described in Chapter 4. 
Regression is one of the problems in supervised learning. The objective is to 
learn a model from a given dataset, {( aq , j/i), ... ,(xn, un)}, and then based 
on the learned model, to make accurate predictions of y for future values of x. 
Support Vector Regression (SVR), a successful method in dealing with this 
problem contains the good generalization ability [20, 17, 8, 6]. The standard 
SVR adopts the ^ 2 -norm to control the functional complexity and chooses an 
e-insensitive loss function with a fixed tube (margin) to measure the empir¬ 
ical risk. By introducing the the optimization problem in SVR can 

be transformed to a quadratic programming problem. On the other hand, the 
e-tube has the ability to tolerate noise in data and fixing the tube enjoys the 
advantages of simplicity. These settings are in a global fashion and are effec¬ 
tive in common applications, but they lack the ability and the flexibility to 
capture the local trend in some applications. For example, in stock markets, 
the data are highly volatile and the associated variance of noise varies over 
time. In such cases, fixing the tube cannot capture the local trend of data 
and cannot tolerate the noise adaptively. 

One typical illustration can be seen in Fig. 6.1. In this figure, the data 
contain larger noise as the x value of the data becomes larger. However, the 
SVR cannot flexibly and suitably handle it. As shown in Fig. 6.1(a), with a 
fixed e-margin (set to 0.04) SVR considers the data globally and equally: The 
derived approximating function in SVR deviates from the actual data trend. 
On the other hand, as illustrated in Fig. 6.1(b), if we adequately consider 
the local volatility of data by adaptively and automatically setting a small 
margin in low volatile regions and a larger margin in high volatile areas, the 
resulting approximating function (the solid line in Fig. 6.1(b)) would be more 
suitable and reasonable. 

Targeting to solve these problems, we propose the Local Support Vector 
Regression (LSVR) model. We will show that with consideration of the local 
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(b) A more reasonable line with non- 
llxed margins 


Fig. 6.1. Illustration of the e-insensitive loss function with fixed and non- 
fixed margins in the feature space. In (b), a non-hxed margin setting is more 
reasonable. It can moderate the effect of the noise by enlarging (shrinking) 
the margin width in the local area with large (small) variance of noise 


data trend, our model provides a systematic and automatic scheme to locally 
and flexibly adapt the margin. Moreover, we will also demonstrate that this 
novel LSVR model can derive special cases, containing a very similar physical 
meaning to the standard SVR. Another critical feature of our model is that 
the associated optimization of LSVR can be cast as a Second Order Cone 
Programming (SOCP) problem which can be efficiently solved in polynomial 
time [11]. The margin setting in the novel LSVR model is different from that 
in our previous work [21]. Concretely, the tube here is adapted directly based 
on the functional complexity and the local trend of data. This hence provides 
a more systematic and more rigorous way to moderate the margin automat¬ 
ically. This model can be seen as an extension to the regression model of 
M 4 . In M 4 , the main purpose is to build a classification boundary for differ¬ 
ent classes, while in LSVR the goal is to model a function approximating the 
data. Therefore, M 4 considers different data trends for different classes, while 
LSVR focuses on employing different data trends in different data regions. 
This is more valuable with the framework of regression tasks. 

The rest of this chapter is organized as follows: the linear LSVR model 
with its theoretical background is presented in Section 6.1. In Section 6.2, we 
demonstrate how the standard SVR can be considered as the special case of 
our proposed model. In Section 6.3, we show the link between our proposed 
LSVR model and the general large margin classifier M 4 . The kernelized LSVR 
is tackled by utilizing the Mercer’s kernel in Section 6.5. Section 6.6 provides 
an additional interpretation on the issue of controlling the complexity of the 
LSVR model. Section 6.7 presents the experiments on both synthetic and 
real data. The chapter is concluded in Section 6.8. 
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6.1 A Local Support Vector Regression Model 

In this section, we first present the problem and model definition of the LSVR 
model. We then detail its interpretation and its appealing characteristics. 
After that we state its corresponding optimization method. 

6.1.1 Problem and Model Definition 

A basic idea to avoid overfitting in function approximation is to restrict the 
class of admissible solutions by a regularization term. A common method 
is to find a function, / : R d i—> R, based on an fV-instance dataset D = 
{( Xi,yi ) | xi £ R d , yi £ R, i = 1 by minimizing the following 

regularized functional risk: 


-Rreg [/] — f?[/] + C • Remp[f] 


where C > 0 is a regularization parameter used as the tradeoff between the 
minimal empirical risk R emp [/] and the smoothness or functional complexity 
controlled by /?[/]. 

Support Vector Regression is a successful regression model following this 
idea. It attempts to find an approximating function in the linear form: 


f(x) = w T x + 6, w, x £ R d , b £ R. 


( 6 . 1 ) 


For the complexity term f2[f], SVR selects f^-norm or other £ p -norm of w. To 
measure the empirical risk R emp [f], the standard SVR uses an e-insensitive 
loss function [20]. 

In order to improve the flexibility of the standard SVR, we propose a 
new regression model, namely Local Support Vector Regression (LSVR). The 
objective is to learn the function in Eq.(6.1) approximating the data in D 
by making the function locally as less volatile as possible while keeping the 
error as small as possible. We formulate this objective as follows: 



( 6 . 2 ) 


s.t. y t - (w T x l + b) < e\J + & , 


( w T Xi +b) - yi < e\Zw T I!iW + £* , (6.3) 

&>0, £>0, i = l,...,N, 


where ^ and are the corresponding up-side and down-side errors at the 
i-tli point, respectively, e is a positive constant, Si is the covariance matrix 
formed by the i- th data point and those data points close to it. 
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6.1.2 Interpretations and Appealing Properties 

In this section, beginning with stating the physical meaning of the term, 
w T SiW, we interpret our novel LSVR model. 

Suppose i/i = w T Xi + b and y j = w T Xi + b. We have the variance around 
the i-th data point as 

1 fc i k 

Ai = 2WKL ^ = 2kTl \- wT ( x i+j ~ = wT Zi w , 

j=—k j=—k 

where 2k is the number of data points closest to the i-th data point. There¬ 
fore, A-i = w T SiW actually captures the volatility in the local region around 
the <-th data point. In addition, Ai can also measure the local functional 
complexity around the i-th data, since it reflects the smoothness of the cor¬ 
responding local region. This will be in details addressed later in Section 6.6. 

By using the first meaning of A, ; = w T SiW (representing the local volatil¬ 
ity), LSVR can systematically and automatically vary the tube: If the i-th 
data point lies in the area with a larger variance of noise, it will contribute to 
a larger e-\/w T SiW or a larger local margin. This will result in reducing the 
impact of the noise around the point; on the other hand, in the case that the 
<-th data point is in the region with a smaller variance of noise, the local mar¬ 
gin (tube), ey w T SiW, will be smaller. Therefore, the corresponding point 
would contribute more in the fitting process. In comparison, the standard 
SVR adopts a fixed margin, which treats each point equally and therefore 
lacks the ability to tolerate the change in noise. 

By engaging the second compelling property of A,; = w T SiW, namely, 
a measure in describing the local functional complexity, LSVR controls the 
overall smoothness of the approximating function by minimizing the average 
of A, as seen in Eq.(6.2). Intuitively, the margin around each point can be 
neither too large nor too small: If the margin is too large, the local data 
trend may not be captured for “over-tolerating” data; if the margin is too 
small, the local data trend may be “over-emphasized” resulting in a highly 
zig-zag approximating curve. Therefore by adding the regularization term, a 
trade-off can be achieved via adapting the parameter C. 


6.2 Connection with Support Vector Regression 

We now analyze the connection of the LSVR model with the standard Sup¬ 
port Vector Regression model. By considering the data trend globally and 
equally, i.e. setting = S, for i = 1, ...,iV, we can transform the opti¬ 
mization of Eq.(6.2) as follows: 
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N 

min Vw T Uw + C + O , 

2 = 1 

s.t. yt — ( w T Xi + b) < eVw T Sw + £* , 
(w T Xi + b) —yi < e\ / w T Sw + £* , 

6> 0, £ > 0, i = l,...,N. 


Further, if £ = J, we obtain: 


mm 


AT N 

Ml+ c £(& + £) 


s.t. j/i - (rnaJi + 6) < \\w ||e + & , 
{wxi + b)-yi < ||e + £* , 

6 > 0, £ > 0, i = l,...,N. 


(6.4) 


(6.5) 

( 6 . 6 ) 


The above optimization problem is very similar to the £i-norm SVR, except 
that it has a margin related to the complexity term. In the following, we will 
prove that the above optimization is actually equivalent to the ^i-norm SVR 
in a meaningful sense. 


Lemma 6.1. The LSVR model with setting S, = I is equivalent to the l\ - 
norm SVR in the sense that: (1) Assuming a unique e\ exists for making i\- 
norm SVR optimal (i.e. setting e to el will make the objective function mini¬ 
mal), if for el the £i-norm SVR achieves a solution {w*, b*} = SVR(el), then 
the LSVR can produce the same solution by setting the parameter e = .. £l t n , 


i.e. LSVR( h^ii ) = SVR(el); (2) Assuming a unique et, exists for making 
the special case of LSVR optimal (i.e. setting e to e 2 w ill make the objec¬ 
tive function minimal), if for e 2 the special case of LSVR achieves a solution 
{w 2 ,b 2 } = LSVR(el), then the £\-norm SVR can produce the same solution 
by setting the parameter e = e^llu^H, i.e. SVTi^ll'u^ll) ~ LSVR(el). 


Proof. Since (1) and (2) are very similar statements, we only prove (1). 
When e of the special case of LSVR is setting to m , the value of the objec¬ 
tive function of LSVR will be at least smaller than the one by simply setting 
{w,b} = {lojj&p, since is easily verified to satisfy the constraints 

of LSVR. Namely, 


lsvr (i 4)- svr(e:) ' 


(6.7) 


where we use >r to represent “superior to”. We assume the solution for e = 
I f "\u in LSVR as {w 2 ,b 2 }. Similarly, by setting e = Cijtef in SVR, we have: 


SVR 



h LSVR 



( 6 . 8 ) 
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Combining Eqs.(6.7) and (6.8), we have: 

svr (' ; rS)- lsvr (r«)- svr( ' :) - (6 - 9) 

Since ej is the unique e making the objective of SVR minimal, Eq.(6.9) implies 
that w -2 = w\. 

In addition, if in LSVR we use the item of w T Sw instead of its square 
root form as the structure risk or complexity risk, a similar proof will also be 
applicable that the t^-norm SVR is equivalent to the special case of LSVR 
with Si = S. In summary, we can see that the LSVR model actually contains 
the standard SVR model as special cases. 


6.3 Link with Maxi-Min Margin Machine 


The LSVR model can also be considered as an extension of the general large 
margin classifier, Maxi-Min Margin Machine (M 4 ) presented previously in 
this book or [10]. Within the framework of binary classifications for class x 
and y, the M 4 is formulated as follows: 


max p, 

p, b 


S.t. 


(w T Xj + b) ^ 
\J w T S x w 
-{w T y j +b) 

yj W T SyW 


i = l,2,...,N x , 
j = 1,2,..., N y , 


( 6 . 10 ) 

( 6 . 11 ) 

( 6 . 12 ) 


where S x and S y refer to the covariance matrices of the x and the y data, 
respectively. 

Within the framework of classifications, M 4 considers different data trends 
for different classes. Analogously, in the novel LSVR model we allow different 
data trends for different regions, which is more suitable for the regression 
purpose. 


6.4 Optimization Method 

In order to solve the optimization problem of Eq.(6.2), we introduce auxiliary 
variables, ti,..., tjv, and transform the problem as follows: 







6.5 Kernelization 125 


min 


N 


N 


-5> + c5> + £*) 


s.t. y l - ( w T Xi + b) < e\Zw T £iW + & , 
(w T Xi + b) - yi < e^w T S.,w + £* , 
\Zw T SiW < U , 


u> o, &>o, er > o, i = i,... ,n . 


(6.13) 

(6.14) 

(6.15) 


It is clear that Eqs.(6.14) and (6.15) are non-convex constraints. This 
may present difficulties in optimizing the LSVR problems. In the following, 
we relax the optimization to a Second Order Cone Programming (SOCP) 
problem [11] by replacing \Jw T Siw with its upper bound ti'. 


min 



N N 

yi u +c y (&+o 


i=l 


i=l 


s.t. yi - ( w T Xi + b) < eti + & , 
( w T Xi + b) — yi < eti + £,* , 
\Jw T SiW < ti , 


u> o, c>o, er>o, * = i, 


■ ,N . 


Since ti is closely related to \Jw T SiW, weighting the margin width with 
ti will contain a meaning similar to the original motivation, i.e. adapting 
the margin flexibly. More importantly, the relaxed form is a linear program¬ 
ming problem under quadratic cone constraints, or more specifically it is a 
Second Order Cone Programming. Therefore, this problem can be solved in 
polynomial time by many general optimization packages, e.g. Sedumi [18, 19]. 


6.5 Kernelization 


In this section we extend the above linear regression model to the non-linear 
one by using the Mercer’s kernel. Suppose the training data are mapped into 
a kernel space or a feature space by the mapping function, ip : i—> RC 

Then, the objective in the feature space is transformed as follows: 


N 


N 


mm 

Wib,ti ,£i ,£* 


— yti + Cy (& + £*) 


i= 1 i= 1 

„T. 


s.t. yi - (w p(xi) + b) < eti + & , 
(w T tp(xi) + b) — yi < eti + , 


\Jw T Sf w < U , 

ti> o, C: > 0, er > 0, i = 1,... ,N . 


(6.16) 
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In order to utilize the Mercer’s kernel, we first present the following theorem. 

Theorem 6.2. If the corresponding local covariance Sf can be estimated by 
the mapped training data, i.e. (pi, Sf can be written as 

1 ' 

<Pi = 2k+1 ’ ( 6 - 17 ) 

j=-k 

k 

Zf = ~ &)(<£(**+:*) ~ ^) T > ( 6 - 18 ) 

j=-k 

where we just consider 2k data points which are the closest to the i-th data, 
then the optimal w lies in the span of the mapped training data. 

Proof. Suppose w = w p + w Q1 where w p is the projection of w in the span 
of the mapped training data, w Q is the orthogonal component to the span. 
Since wJ mo tp(xi) = 0, i = 1,..., IV, we can easily know that: 

W T lfi(Xi) = W p lfi(Xi) , 

w T Sfw = w p Sfw p . 

Therefore, we can omit w a since it disappears in the optimization. We then 
set it to 0 and obtain w = w p , i.e. the optimal w lies in the span of the 
mapped training data. 

N 

By using Theorem 6.2, we write w as ^ pjp>{xj) and substitute it into 

j -1 

Eq.(6.17). By rewriting Eq.(6.17) in the kernel form by a kernel function 
K(zi,z 2 ) = p(zi) T ip(z 2 ), we then obtain: 


N 

w T ip(xi) = y^Kjxuxj) = n T Ki , 

3 =1 


T TrTr 

w 2 j^ w = n L i Lifi , 


where /x = [m,..., pn] T , K, = [K(x 1 ,x i )... K(x Nl Xi)] T , K i:j = K(x i ,x j ), 


(K, /..I ... K, 


L, = 


— [i-k-.i+k,N]-1-2k+llJ ), K[i-k:i+k,N] ~ 


k,N 


\ ^-i+k,l • • ■ Ki+k, 


N , 


(h )t — 2fc + l S K( X i+j] x t ), and l 2 fc+i is a column vector with ones of di- 
j = — k 

mension 2k + 1. 

Consequently, the corresponding objective in Eq.(6.16) becomes: 
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mm 



s.t. yi - (fi T Ki + b) < eti + & , 
(fj, T Ki + b)-yi< eti + £* , 



t*>0, &>0, £>0, i = l,...,N. 


Hence we only need a kernel function in the optimization without knowing a 
specific mapping function and it can be easily solved by the SOCP methods. 

6.6 Additional Interpretation on w T UiW 

We now interpret in terms of sparse approximation [2, 3, 7, 5, 4, 9, 14] why 
w T SiW can be considered as the local complexity around the data point a 
In [7], Girosi has demonstrated an equivalence between sparse approxi¬ 
mation and Support Vector Machines. In the view of sparse approximation, 
the regression can be regarded as the task of approximating data using lin¬ 
ear superpositions of basis functions selected from a large, redundant set of 
basis functions, called dictionary [12]. A common sense in choosing a good 
approximating function is that one should not only approximate the given 
data as accurately as possible, more importantly, one should use as few as 
possible basis functions. Therefore, a sparsity concept is invoked, i.e. the ap¬ 
proximating function should be sparse in using the basis functions. When it 
is connected with Support Vector Regressions, the readers can regard that 
a basis function is associated with each data point (note that the regres¬ 
sion function can be represented as the linear combination form in the kernel 
space). The fact that SVR contains the property of sparsity, i.e. only a small 
fraction of data points (support vectors) makes contributions to the final 
approximating function, may therefore explain why it has achieved a great 
success. The measure of sparsity of the approximating function /, which is 
also regarded as the measure of complexity is formulated as follows: 





(6.19) 


( 6 . 20 ) 


It is well known that the ^o-norm of a vector counts the number of elements 
different from zero. The complexity term can also be described as: 




( 6 . 21 ) 
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However, due to involving in minimizing a combinatorial term as the above, 
it is extremely difficult to perform the optimization in practice. Therefore, 
instead, one often uses t?i-norm as its approximated version, i.e. 

m = \H\ p tl • ( 6 . 22 ) 

When p is set to 1, it therefore leads to the standard Id-norm SVR. When 

N /- 

one looks back on the LSVR model, minimizing (1 /N) y w T presents 

i —1 

another approximated version to the sparsity, since it also tries to make w as 

N _ 

sparse as possible. 1 Another advantage of using (1 /TV) yw T IJiW is that 

i—1 

it leads to an easy solving method as illustrated in Section 6.4. 


6.7 Experiments 

In this section, we report the experiments on both synthetic sine datasets and 
real world datasets. The SOCP problem associated with our LSVR model is 
solved by a general software, Sedumi [18, 19]. The SVR algorithm is per¬ 
formed by LIBSVM [1], 


6.7.1 Evaluations on Synthetic Sine Data 

Fifty examples ( Xi,yi ) are generated from a sine function [16], where Xi are 
drawn uniformly from [—3, 3], and yt = sm(i:Xi) / (nXi) + Ti, with r» drawn 
from a Gaussian with zero mean and variance a 2 . Two cases are evaluated. 
One is with a = 0. The standard deviation of the data in the other case 
increases linearly from 0.5 at x = —3 to 1.5 at x = 3. It is clearly observed that 
in the second case, the variance of noise is different in different regions. We use 
the default parameters C = 100, the RBF kernel K(u,v) = exp(— ||u — ?;|| 2 ). 

Table 6.1 reports the average results over 100 random trails with different 
e values. Fig. 6.2 illustrates the difference between the LSVR model and the 
SVR algorithm when e = 0.2. For the case I, a = 0.0, the LSVR model can 
adjust the tube automatically to fit the data with a smaller Mean Square 
Error (MSE), which can be seen in Fig. 6.2(c). However, containing a fixed 
tube, the SVR algorithm lacks the flexibility (see Fig. 6.2(a)). This also yields 
that the MSE increases as e increases. As reported in Table 6.1, when e > 0.8, 
there are no support vectors in SVR and MSE is the largest. In case II, the 
LSVR model has smaller MSE’s and smaller STD’s for all e’s. Fig. 6.2(d) also 
shows that the obtained approximating function in LSVR is smoother than 
that in SVR. 


intuitively, when w is sparser, (1/1V) Vw T SiW would be smaller. 
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Table 6.1. Experimental results (MSEiSTD) of the LSVR model and the SVR 
algorithm on the sine data with different e values 


e 

Case I 

O 

O 

II 

b 

Case II : Varying a 


LSVR 

SVR 

LSVR 

SVR 

0.0 

0 

0 

0 . 18254 = 0.1011 

0 . 3101 ± 0.1165 

0.2 

0.0004 

0.0160 

0 . 2338 ± 0.0888 

0 . 2761 i 0 .ini 

0.4 

0.0016 

0.0722 

0 . 1917 ± 0.0726 

0 . 2217 ± 0.0840 

0.6 

0.0044 

0.1695 

0 . 1540 ± 0.0687 

0 . 23844 = 0.0867 

0.8 

0.0082 

0.1748 

0 . 1333 ± 0.0674 

0 . 23334 = 0.1096 

1.0 

0.0125 

0.1748 

0 . 11154 = 0.0597 

0 . 2552 ± 0.1218 

2.0 

0.0452 

0.1748 

0 . 0959 ± 0.0421 

0 . 2616 ± 0.1517 





A 


(b) SVR with varying a 



A 


(d) LSVR with varying a 


Fig. 6.2. Experimental results on synthetic sine data with e=0.2 
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6.7.2 Evaluations on Real Financial Data 

We evaluate our model on the financial time series data which are highly 
volatile and non-stationary. The experimental data are three major indices: 
(1) the Dow Jones Industrial Average (DJIA), (2) the NASDAQ, and (3) the 
Standard & Poor 500 index (S&P500) in the period from January 2, 2004 
to April 30, 2004. We choose this period of data because three indices data 
contain different statistical properties as reported in Table 6.2. Especially, 
one may note that the data in this period for three indices contain largely 
different skewness. In this way, the diversity in the data may not bias the 
comparison of the models. 


Table 6.2. Summary statistics of normalized returns of DJIA, 
NASDAQ and S&P500 in the experiments. These indices show 
different statistical properties. 


Moments 

DJIA 

NASDAQ 

S&P500 


Train 

Test 

Train 

Test 

Train 

Test 

Mean 

0.0000 

-0.2850 

- 0.0000 

-0.4819 

0.0000 

-0.3858 

S.D. 

1.0000 

0.9957 

1.0000 

1.1312 

1.0000 

1.1298 

Skew 

-0.0678 

0.1684 

0.0928 

0.3256 

-0.1298 

-0.0102 

Kurt 

2.5437 

2.7706 

2.6600 

1.8631 

2.5308 

2.4124 


Following the procedure in [15], we convert the daily closing prices ( d t ) 
of these indices to continuously compounded returns (r t = log(d t+ i/d t )) and 
set the ratio of the number of the training return series to the number of 
test return series to 5 : 1. We perform normalization on these return series 
by R t = (rt — Mean(rt))/SD(rt), where the means and standard deviations 
are computed for each individual index in the training period. 

We compare the performance of the LSVR model against the SVR. The 
predicted system is modelled as R t = f(x t ), where x t takes the previous four 
days’ normalized returns as indicators, i.e. x t = (Rt— 4 , Rt- 3 , Rt- 2 , Rt-i)- 
Here this simple setting we employ is based on the suggestions in [15]: A 
suitable selection for the sequent values is four. We then apply the modelled 
function / to test the performance by one-step ahead prediction. The trade-off 
parameter C and the parameter of the RBF kernel ( K(u,v ) = exp(— (3\\u — 
u|| 2 )), (C, /?), are obtained by a five-fold cross-validation conducting the 
SVR on the following paired points: [2 -5 ,2 -4 ,..., 2 10 ] x [2 -5 , 2 -4 ,..., 2 10 ]. 
We obtain the corresponding parameters as (2 4 ,2 -3 ) for DJIA, (2~ 3 ,2 1 ) for 
NASDAQ, and (2°,2 2 ) for S&P500. 

As suggested in [15], there is a relationship in the sequential five days’ 
values. We select k = 2, i.e. five days’ values, to model the local volatility. 
Since when e > 2.0, there are no support vectors in SVR, we just set the e 
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values from 0.0,0.2,..., 1.0, to 2.0. The corresponding results are reported 
in Table 6.3. As observed, the LSVR model demonstrates a consistent supe¬ 
riority to the SVR algorithm, even though the paired parameters (C, (3) are 
not tuned for our LSVR model. Furthermore, a paired f-test [13] performed 
on the best results of both models in Table 6.3, shows that the LSVR model 
outperforms SVR with a = 10% significance level for a one-tailed test. 


Table 6.3. Experimental results of the LSVR model and the SVR algo¬ 
rithm on the financial data with different e values 


e 

DJIA 

NASDAQ 

S&P500 


LSVR 

SVR 

LSVR 

SVR 

LSVR 

SVR 

0.0 

0.9204 

1.3241 

1.2897 

1.3050 

1.2372 

1.2833 

0.2 

0.9835 

1.1274 

1.2896 

1.3246 

1.2399 

1.2831 

0.4 

0.9341 

0.9156 

1.2898 

1.3314 

1.2442 

1.2952 

0.6 

0.9096 

0.9387 

1.2901 

1.3404 

1.2540 

1.2887 

0.8 

0.9273 

0.9450 

1.2904 

1.3891 

1.2788 

1.2798 

1.0 

0.9434 

0.9713 

1.2908 

1.4105 

1.3044 

1.2664 

2.0 

0.9666 

1.0337 

1.2928 

1.3619 

1.2643 

1.3220 


6.8 Summary 

In this chapter, we propose a Local Support Vector Regression model. Dif¬ 
ferent from the standard Support Vector Regression model, our novel model 
offers a systematic and automatic scheme to locally and flexibly adapt the 
margin. Therefore, it can tolerate the noise adaptively. We demonstrate that 
the promising model can not only capture the local information of the data 
in approximating functions, but also can branch out similar models to the 
standard SVR. The experiments conducted on sine datasets and three indices 
data from stock markets show that our model outperforms the standard SVR. 
One future work of this model is to investigate efficient methods to directly 
solve the original optimization of LSVR instead of solving a relaxed form. In 
addition, both theoretical and empirical comparisons between the true solu¬ 
tion and the approximated relaxed solution quantitatively are also valuable 
research topics in the future. 
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7 


Extension III: Variational Margin Settings 
within Local Data in Support Vector 
Regression 


In Chapter 6, we propose a Local Support Vector Regression Model to include 
the local information of data. In this chapter, we consider another extension 
of the Support Vector Regression (SVR) which also includes the local infor¬ 
mation of data for a specific application, i. e. financial engineering. Both these 
models are motivated from the local viewpoint of data. 

SVR is derived from the Support Vector Machine which is based on 
the principle of Structural Risk Minimization (SRM). Due to its solid 
theoretical ground, SVR has been applied successfully in time series predic¬ 
tion [9, 10]. Usually, when SVR is applied in time series forecasting, it uses 
the e-insensitive loss function to measure the empirical risk. This loss func¬ 
tion contains an e margin. It not only measures the training error (empirical 
risk), but also controls the sparsity of the solution (the number of support 
vectors). When the width of e-margin increases, it may tend to reduce the 
number of support vectors. Extremely, a too wide margin may result in a 
constant regression function. When the width of e-margin decreases, it may 
increase the number of support vectors. Ultimately, all the data points are 
used for support vectors [19]. In this case, it may include the data noise in 
seeking the regression function. Hence, setting the width of e-margin is very 
important. It affects the complexity and the generalization of the regression 
function indirectly. 

Normally, the setting of e is fixed, which is a kind of global setting. 
However, in some applications, e. g. financial engineering, the global setting 
will not be an optimal choice. Since financial data are usually volatile and 
noisy, we extend the previous global margin setting to a variation one which 
includes the local information of data. 

In the following, we will first describe the SVR model briefly in Section 7.1. 
We then indicate the problem of margin settings in Section 7.2. To solve the 
problem of margin settings, we propose a general e-insensitive loss function 
for SVR in Section 7.3. We further aim at a specific application, i. e. financial 
engineering by introducing momentum and including GARCH model for the 
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variational margin settings in Section 7.4. After the detailed experimental 
setup and experimental results in Section 7.5, we conclude the chapter with 
discussions in Section 7.6. 


7.1 Support Vector Regression 


The aim of SVR is to find a function / with parameters w and b by minimizing 
the regression error as follows: 

1 N 

^reg(/) = 2 ( w ,w) +Cj2Kf(Xz),yi) , (7.1) 

;=i 

where (,) denotes the inner product. This Euclidean norm (w,w) measures 
the flatness of the function /. Minimizing (w,w) will make the regression 
function as flat as possible [16]. 

The function / is then defined as 


f(x,w,b) = (w,<p(x)) + b , (7.2) 

where 4>(x) : x —> 17, maps x £ X(R rf ) into a high (possible infinite) dimen¬ 
sional space 17, and b £ R. 

There are several loss functions which could be used to measure the re¬ 
gression error, e.g. squared loss function, Huber’s loss function, e-insensitive 
loss function, etc. In SVR, the e-insensitive loss function is used to measure 
the loss [19] (illustrated in Fig. 7.1): 


k{y,f(x)) 


0, if \y — f(x)\ < e ; 

| y — f(x) | — e, otherwise . 


(7.3) 


The advantage of this loss function is that it could affect the seeking of 
regression function implicitly. 



Fig. 7.1. Linear regression in the feature space by e loss function 
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To solve the minimization of Eq.(7.1) with loss function of Eq.(7.3) is 
equivalent to solving the following constrained minimization problem: 

1 N 

min r(w,b,£ < -*’>)= -(w,w) + C^2(£i+€i) , (7-4) 

2=1 

subject to 

lH ~ ({w, (f>(xi)) + b) < e + & , 

((w,<t>(xi)) + b)~ yi < e + Q , (7-5) 

> 0 ■ 

Here and below, for every i, it ranges from 1 to AT and (*) is a shorthand 
implying both the variables with and without asterisks. and £* measure 
the up error and down error for the sample point respectively, see 

Fig. 7.1. 

A standard method to find the optimal solution of the above minimiza¬ 
tion problem in Eq.(7.4), further finding the function / in Eq.(7.2), is to 
construct the dual problem of this optimization problem (primal problem) 
by the Lagrange Method and to translate the (primal) minimization prob¬ 
lem to maximize its dual function. Therefore, the optimization becomes a 
Quadratic Programming (QP) problem as follows [19]: 

jv i\r 

min Q(a(*>) = - ~ a *)( a j ~ a j)(0( x i), </>(xj)) 

i=i j=i 

N N 

+ ~ y ^ ai + + y^i > ( 7 - 6 ) 

2—1 2 = 1 

subject to 

N 

£>*-<0=0, G [0, C\ . (7.7) 

2=1 

After solving this QP problem, we obtain the objective function as: 

N 

/( X ) = +b , 

*=1 

where a , a* are the Lagrange multipliers used to pull and push / towards 
to the observation y. Those sample points (a 'i,yi) with nonzero c* or a* are 
called support vectors. 

By using the trick of kernel function, one could define the kernel func¬ 
tion as the inner product of mapping function, i. e. K{x,z) = (<j>(x), <j>(z)). 
Therefore, one only needs to specify a kernel function without considering the 
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mapping function or the feature space explicitly. The property of the kernel 
function is that it should satisfy the Mercer’s Theorem [6, 14]. 

Four kernel functions are common used: 

Linear function: K(xk,x{) = (xk,xi) ; 

Polynomial function with parameter d, K(xk,xi) = ((xk : xi) + l) d ; 
Radial Basis Function (RBF) with parameter (3: 

K(x k ,xi) = exp(-f3\\x k - xi\\ 2 ) , (7.8) 

Hyperbolic tangent: K(xk,xi ) = ta,nh(2(xk,xi) + 1) . 


7.2 Problem in Margin Settings 

Since the width of e-margin holds the ability to affect the complexity and the 
generalization of the regression function indirectly, it is very important to 
seek an optimal value for different applications. Commonly, the e is difficult 
to control [10], as one does not know beforehand which one is able to fit the 
curve better. 

Usually, there are several methods to deal with it. Firstly, most practition¬ 
ers set the value of e as a non-negative constant value just for convenience. 
For example, in [18], they simply set the margin width to 0. This amounts to 
the least modulus loss function. In other instances, the margin width has been 
set to a very small value [5, 9, 20]. The second method is the cross-validation 
technique, e. g. [4, 10]. It is usually too expensive in terms of computation. A 
more efficient approach is to use another variant called jz-SVR [12, 13, 14, 15], 
which determines e by using another parameter u. It is stated that v may 
be easier to specify than e. Another approach by Smola, et al. [17] is to find 
the “optimal” choice of e based on maximizing the statistical efficiency of a 
location parameter estimator. They showed that the asymptotically optimal 
e should be scaled linearly with the input noise of training the data, and this 
was verified experimentally. Recently, a regularization path was proposed for 
SVR to seek optimal parameters in [7, 21]. 

In financial time series, however, the data are noisy and high volatile. The 
fixed margin setting is not suitable for this special application. We therefore 
extend the fixed e margin setting to variational ones. 


7.3 General e-insensitive Loss Function 

First, we note that the margin in e-insensitive loss function contains two 
characteristics: fixed and symmetrical. Based on these two characteristics, we 
have proposed a general e-insensitive loss function and classified the margin 
into four cases in [22]: Fixed and Symmetrical Margin (FASM), Fixed and 


7.3 General e-insensitive Loss Function 


137 


Asymmetrical Margin (FAAM), Non-fixed and Symmetrical Margin (NASM) 
and Non-fixed and Asymmetrical Margin (NAAM). Table 7.1 gives a simple 
description of these four categories. FASM is equivalent to the margin in 
e-insensitive loss function, see Fig. 7.2(a). FAAM is divided into up margin 
and down margin, each margin is fixed but they are not equal (Fig. 7.2(b)). 
While NASM is with equal up margin and down margin, but they are varied 
with data (Fig. 7.2(c)). NAAM combines two characteristics of the margin 
(Fig. 7.2(d)). 


Table 7.1. Margin categories 



Symmetrical 

Asymmetrical 

Fixed 

FASM 

FAAM 

Non-fixed 

NASM 

NAAM 






Fig. 7.2. Four categories in general e-insensitive loss function of SYR 


In the following, we will derive the SV formula based on the general 
e-insensitive loss function. The general e-insensitive loss function splits the 
margin in the original e-insensitive loss function into two parts: up margin 
and down margin, 


Uf{ x i)~Vi ) 


0, if -d(xi) <yi- f{xi) < Ufa); 

< Vi ~ f(xi) ~ u(xi), if yi - f(xi) > u(xi); 

. f(xi) -Vi- d(xi), if f(xi) -in> dfa), 


(7.9) 




















138 7 Extension III: Variational Margin Settings within Local Data 


where d{x i ) 1 u(xi) > 0, are two functions determining the down-margin and 
up margin at point Xi respectively. When d(x) and u(x ) are both constant 
functions and d(x) = u{x ), Eq.(7.9) amounts to the e-insensitive loss function 
in Eq.(7.3) and we label it as FASM (Fixed and Symmetrical Margin). When 
d(x) and u(x) are both constant functions but d{x) u(x), this case is 
labeled as FAAM (Fixed and Asymmetrical Margin). In the case of NASM 
(Non-fixed and Symmetrical Margin), d{x) = u(x) but are varied with the 
data. The last case is with a non-fixed and asymmetrical margin (NAAM) 
where d{x) and u{x) are varied with the data and d(x) yf u{x). 

In the same way, we use the standard method to find the solution of 
Eq.(7.1) with the cost function of Eq.(7.9) as [19] and obtain: 

mi n |!<«bw) + CX)(&+£*)}, ( 7 - 10 ) 

W i=i J 

subject to 


Di - (w, (j>(xi)) - b< u{xi) + & , 
(w, <j>(xi )) + b-yi< d(xi) + £* , 

d* 3 > o ■ 


Using the standard primal-dual method as above, we also obtain a QP prob¬ 
lem as follows: 

N N 

min #(aM) = - ^ ^(^ - a*)(aj - a*)(<f>(xi), 

2—1 3 =1 

N N 

+ ^2( u ( x i) - Vi)^z + J2(d(xi) + yi)a* , (7.11) 

2=1 2=1 


subject to 

N 

53(0* - a?) =0, a»,a* e [0,(7] . 

2 = 1 

This QP problem is very similar to the original QP problem in Eq.(7.6), 
therefore, we just need to modify the SMO algorithm a little bit to implement 
this QP problem. Practically, we add a new data structure to store both 
margins: up margin, u(x), and down-margin, d(x). This will not impact the 
time complexity of the SVR algorithm; we just need more space linear to 
the size of data points to store the corresponding margins. We modify the 
LIBSVM from [5] to implement the SVR algorithm. 

After solving this QP problem, we then obtain the regression function: 


N 

/( x ) = ^(^{x,),^))+b , 

2=1 


(7.12) 


7.4 Non-fixed Margin Cases 139 


where a, a* are corresponding Lagrange multipliers also used to pull and 
push / towards to the observation y. 

The computation of b is exploited by the Karush-Kuhn-Tucker (KKT) 
conditions. Here, they are: 


oti(u(xi) +^i-yi+ ( w , 4>{xi)) +b)= 0 , 
a* (d(xi) + + yi + (w, </>(xi)) - b) = 0 , 


and 


(c - on )6 = o , 

(C-a*){* = 0. 

Therefore, b can be computed as follows: 

, f y% - {w, (t>{xi)) - u{xi), for an G (0, C) ; 
\yi- (w, 4>{xi)) + d{xi), for a* G (0, C ) . 


When no a. 


{*) 


G (0,(7), methods e.g. [5] are used. 


7.4 Non-fixed Margin Cases 

7.4.1 Momentum 

In [23], we have focused on the case of NAAM. More specially, we have added 
a momentum term in the margin setting. The margin is a linear combination 
of the standard deviation and the momentum. The up margin and down- 
margin are set in the following forms: 


u(xi) = Ai ■ o(xi) + y. ■ A(xi), i = l,...,N, 

d(xi) = A 2 • a(xi) - /I ■ A(x{), i = 1,..., N, (7-13) 

where a(xi) is the standard deviation of input cCj, A(xi) is the momentum at 

point Xi, Ai, A 2 are both positive constants and y is a non-negative constant. 

Therefore, the width of margin at point Xi is: 


W(xi) = (Ai + A 2 ) • cr(xi) . 

It is determined by cr(xj) and the sum of Ai and A 2 . Here we called Ai, A 2 
as the coefficients of the margin width. We also called y as the coefficient of 
momentum and we know that the margin setting of Eq.(7.13) includes the 
case of NASM (when y = 0). 

From [22], when y ^ 0 and A(x) > 0, the up margin is larger than the 
down-margin and we can under-predict the stock price. While y ^ 0 and 
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A{x) < 0, the up margin is smaller than the down-margin and we can over¬ 
predict the stock price. A simple illustration is shown in Fig. 7.3. Based on 
these observations, in our prediction we assume that we are risk aversion, or 
downside risk aversion. When the stock price reveals an uptrend, we know 
that it will not be always up, so we tend to under-predict the stock prices 
in this case. On the contrary, when the stock price goes down, we tend to 
over-predict it. We add this information in the margin setting by controlling 
the momentum term. 



Fig. 7.3. Margin settings: dashed lines are the bounds of margins; dashed- 
dotted lines are actual data series; solid-bold lines are the new objective 
function, / new , by new margin settings. The upper shadow area is the case 
of new objective function under-predicted to the actual function; the lower 
shadow parts are the case of “over-predicted” 


Actually, there are many ways to calculate the momentum. For example, 
the simplest way is to set it as a constant. In this chapter, we will concentrate 
on using the Exponential Moving Average (EMA). The reason of using EMA 
is that it is time-varying and can reflect the uptrend and clown-tendency of 
the financial data. A little deficiency is that there exists the lag problem. An 
n-day’s EMA sequence begins from the first day, i. e. EMA\ = y\ and the 
following is calculated by: 

EMAi = EMAi-i x (1 - rJ + jjXr, 

where r = 2/(1 + n), and yi is the information about day i, e.g. the closing 
price in day i, the volume in day i, etc. Here, the current day’s momentum 
is set as the difference between the current day’s EMA and the EMA in the 
previous k day, i.e. 

A{xi) = EMAi ~ EMAi_ k . 


7.4.2 GARCH 

In the above methods, the datasets we used in the experiments are the price of 
the share [22, 23]. We use the standard deviation of input Xt, which can reflect 
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the volatility of the financial time series over time, to determine the width of 
margin at time t in our prediction. Actually, the Generalized AutoRegressive 
Conditionally Heteroscedastic (GARCH) model [3] is a more common used 
model to reflect the volatility of the financial time series. 

The standard GARCH( p, q) model with Gaussian shocks takes the fol¬ 
lowing form: 

Vt = c 0 + xjb + e t , etl'I't-i = N(0, of) , 

where 

v Q 

a t = K o + ^2 ^i a t-i + X] • 

i =1 i=i 

This GARCH toolbox is applied to the return series. So we use the con¬ 
tinuous compounded return as the data series and use the ay calculated by 
GARCH (1,1) as the width of margin at time t. 


7.5 Experiments 

In this section, we will perform the experiments by using the momentum and 
GARCH models to set the margins. Before illustrating the experiments, we 
define the accuracy and risk measurement first. 

7.5.1 Accuracy Metrics and Risk Measurement 

In order to measure the prediction performance of our model, we define the 
Mean Absolute Error (MAE). 

Let at and pt be the actual values respectively and predicted values at 
day t, let to be the number of testing data. 

Definition 7.1. Mean Absolute Error (MAE) measures the discrepancy 
between the actual and predicted values; the smaller the value of MAE, the 
closer are the predicted values to the actual values. MAE is calculated by: 

m 

MAE = - y\a t - p t \ . (7.14) 

f=l 

We also consider the risk of using this model in the prediction. Actually, 
risk is a term frequently encountered in strategic management and financial 
literature. However, risk has a variety of different meanings and rarely is 
the meaning used in a particular project clarified in [2]. In financial litera¬ 
ture, Markowitz first formulated the portfolio selection into a mathematical 
model [8]. In his model, the “return” of a portfolio is measured by the ex¬ 
pected value of the random portfolio return and the associated “risk” is quan¬ 
tified by the variance of the portfolio return. However, the use of variance 
to measure risk makes no distinction between gains and losses. Markowitz 
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also proposed to use semi-variance to measure the risk of loss. That is the 
sum of the squares of negative deviations from the mean divided by the total 
number of observations: 


— ^[min(r t - /qO )] 2 . 

m t =1 

The great advantage of the use of semi-variance over variance is that 
it does not include positive gains, so what is considered as risk takes into 
account only negative deviations. However, minimizing downside does not 
mean minimizing only negative deviations. For example, if the distribution, 
like the normal curve, is symmetric, minimizing variance and semi-variance 
will lead to the same problem. The only case that justifies the use of semi¬ 
variance is when the presence of skewness is observed [1]. A generalization of 
semi-variance is given in [1]: 


downside risk => 


1 

m 


m 

y^[min(r t 

t= 1 




(7.15) 


where k is any power that one chooses; when k= 1, it should be considered 
the absolute value of the term in the brackets and /i is a chosen benchmark 
(not necessarily the mean). 

Based on Eq.(7.15), we choose k =1 and define the following risk measure¬ 
ments. 

Definition 7.2. Upside Mean Absolute Error (UMAE) measures up¬ 
side risk; the smaller the value of UMAE, the smaller the upside risk. UMAE 
is defined as: 

, m 

UMAE = - y {a t - pt ) . (7.16) 

a t>Pt 

Definition 7.3. Downside Mean Absolute Error (DMAE) measures 
the downside risk; the smaller the value of DMAE, the smaller the downside 
risk. DMAE is defined as: 


^ lit, 

DMAE= — y(pt-at). (7.17) 

t =i 
a t <pt 


7.5.2 Momentum 

We compare the modified SVR algorithm by adapting margins using momen¬ 
tum with the AutoRegression (AR) model and the Radial Basis Function 
(RBF) method. The results are presented as follows one by one for three 
algorithms. 


7.5 Experiments 143 


7.5.2.1 SVR Algorithm 

Two datasets are used in this experiment: 

HSI: daily closing prices of Hong Kong’s Hang Seng Index (HSI) from 
January 2nd, 1998 to December 29, 2000. 

DJIA: daily closing prices of Dow Jones Industrial Average (DJIA) from 
January 2nd, 1998 to December 29, 2000. 

The ratio of the number of training data and the number of testing data 
is set to 5:1. Therefore, the corresponding initial training time periods are 
obtained and listed as in Table 7.2. 


Table 7.2. Indices, time periods and parameters for momentum experi¬ 
ments 


Indices 

Initial training time periods 

C 

0 

HSI 

02/01/1998 - 04/07/2000 

16000 

2-27 

DJIA 

02/01/1998 - 29/06/2000 

8000 

2-22 


Furthermore, we model the system as pt = f(x t ), where / is learned by 
the SVR algorithm from the training data, x t = (at- 4 , at- 3 , <Zt_ 2 , at-i), it 
is the daily closing index in day t. 

Before generating the model, we do a cross-validation on the initial trai¬ 
ning data to determine the parameters that are needed in SVR. They are C, 
the cost of error and /?, the parameter of kernel function. The corresponding 
parameters are also listed in Table 7.2. With these parameters we begin to 
build the model by SVR from the initial training data. After obtaining the 
predictive value, we shift the input window to the next time-step and train 
the model again to obtain the next day’s price. This one-step ahead prediction 
is done as the window shifted for the remaining data. 

Non-fixed Cases: The margins setting is followed as Eq.(7.13). In the 
case of NASM, we set Ai = A 2 = 1/2 and /z = 0, thus the overall margin 
width at day t is equal to the standard deviation of input x t , <j{x t ). 

In the case of NAAM, we also fix Ai = A 2 = 1/2, hence we have a fair com¬ 
parison of NASM case. In addition, we have to determine three parameters, 
i.e. n , the length of EMA; k, the lag of EMA; /z, the coefficient of momentum. 
We have performed the following experiments to test their effects: 

(a) At first, we set k = 1, /z = 1 and use 10, 30, 50, 100 as the length of 
EMA respectively. From the result of Table 7.3 we can see that the DMAE 
values in all cases of NAAM are smaller than that in NASM case, thus we have 
a smaller downside risk in NAAM case; this exactly meets our assumption. 
We also see that the MAE gradually decreases with the increase of the length 
of EMA, and that when the length equals 100, the MAE and the DMAE are 
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the smallest in all cases of NAAM for dataset HSI. For dataset DJIA, when 
the length equals 30, the MAE and the DMAE are also the smallest in all 
cases of NAAM. 


Table 7.3. Effect of the length of EMA on HSI with parameters 
(fc,/*)=(!,!) 


Type 

n 

HSI 

DJIA 



MAE 

UMAE 

DMAE 

MAE 

UMAE 

DMAE 

NASM 


216.78 

104.58 

112.20 

85.33 

40.29 

45.04 


10 

222.43 

115.64 

106.79 

85.68 

43.13 

42.55 

NAAM 

30 

218.18 

114.04 

104.14 

84.12 

41.82 

42.30 


50 

217.93 

113.38 

104.55 

84.57 

42.12 

42.45 


100 

216.50 

113.04 

103.46 

84.80 

42.41 

42.39 


In the following, we will use the best length of EMA from the above 
experiments for the corresponding datasets, i. e. n = 100 for data set HSI 
and n = 30 for dataset DJIA. 

(b) When testing the effect of lag k , we let /i = 1 and set k to 1, 2, 4, 8 
respectively for both datasets. The results are listed in Table 7.4. They show 
that the MAE increases with increasing of the lag of EMA. These indicate 
that the results when the lag of EMA equals 1 are superior to the other cases. 


Table 7.4. Effect of the distance of EMA on HSI and DJIA 


k 

HSI with (n, k) = 

(100,1) 

DJIA with (n, k ) 

= (30,1) 


MAE 

UMAE 

DMAE 

MAE 

UMAE 

DMAE 

1 

216.50 

113.04 

103.46 

84.12 

41.82 

42.30 

2 

219.02 

125.30 

93.72 

85.42 

43.91 

41.51 

4 

228.25 

149.36 

78.88 

90.99 

49.16 

41.83 

8 

260.73 

200.74 

59.99 

103.77 

58.03 

45.74 


(c) Here, we set k = 1 and fi = 1,1/2,1/4,1/8 respectively for both 
datasets to see the effect of the [i. From Table 7.5, we see that the DMAE 
increases gradually with decreasing of the coefficient of EMA and that the 
MAE is smaller than the value in the NASM case. The change of the MAE 
for dataset HSI in (2—4 columns of) Table 7.5 is fluctuating and the MAE 
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in (5—7 columns of) Table 7.5 increases gradually with the decrease of the 
coefficient of EM A. 


Table 7.5. Effect of the coefficient of momentum on HSI and DJIA 



HSI with (n, k ) = 

(100,1) 

DJIA with ( n , k) 

= (30,1) 


MAE 

UMAE 

DMAE 

MAE 

UMAE 

DMAE 

1 

216.50 

113.04 

103.46 

84.12 

41.82 

42.30 

1/2 

216.55 

108.97 

107.58 

84.88 

41.32 

43.56 

1/4 

216.19 

106.36 

109.83 

85.02 

41.14 

43.88 

1/8 

216.41 

105.32 

111.08 

85.22 

40.86 

44.36 


We also plot the daily closing prices of HSI with 100 days’ EMA and 
the prices of DJIA with 30 days’ EMA in Fig. 7.4 and Fig. 7.5 respectively, 
and list the Average Standard Deviations (ASD) of input x of the training 
datasets HSI and DJIA, respectively in Table 7.6, the Average of Absolute 
Momentums (AAM) of input x for the best length of both training datasets 
respectively in Table 7.6. We can observe that the ASD of HSI is higher than 
that of DJIA and that the ratio of AAM to ASD is smaller for HSI than that 
for DJIA. 


Table 7.6. ASD and AAM 




AAM 


Dataset 

ASD 

n 

A 

Ratio 

HSI 

182.28 

100 

20.80 

0.114 

DJIA 

79.95 

30 

15.64 

0.196 


Now, we will make a summary for the above experiments. At first, we can 
know the effects of n, k and /j, from the above experiments results. Following 
these results, we can say that a suitable setting for k and p, will both be 
1, which can be applied when a new dataset comes. The only parameter 
needed to determine is the length of EMA, n, this may refer to the ASD of 
the training dataset. When the ASD is larger, we may use a longer length 
of EMA. On the contrary, when the ASD is smaller, we may use a shorter 
length of EMA. 

Fixed Cases: After considering the non-fixed margin cases, we also test 
the predictive results of fixed margins. Actually, for dataset HSI, we let 
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the width of margin equal to 200 (approximate to the ASD of HSI), i.e. 
u(x ) + d(x) = 200. The up-margin u(x) ranges from 0 to 200, each incre¬ 
ment is one-tenth of 200, i. e. 20. The results are listed in (1—5 columns of) 
Table 7.7. Similarly, for dataset DJIA, we let the width of margin equal to 
90 (approximate to ASD of DJIA), i.e. u(x) + d{x) = 90. The up-margin 
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u(x) ranges from 0 to 90, each increment is also one-tenth of 90, i. e. 9. The 
results are listed in (6—10 columns of) Table 7.7. We can see that for both 
datasets, as the up-margin increases, the DMAE tends to decrease. 


Table 7.7. Results of FASM and FA AM for HSI and DJIA 


HSI [u(x)+d(x)] 


DJIA [w(*)-|-d(x)] 


u(x) 

d(x) 

MAE 

UMAE 

DMAE 

u(x) 

d(x) 

MAE 

UMAE 

DMAE 

0 

200 

236.04 

62.24 

173.80 

0 

90 

91.63 

20.45 

71.18 

20 

180 

230.85 

69.65 

161.20 

9 

81 

89.14 

23.70 

65.44 

40 

160 

226.29 

77.37 

148.92 

18 

72 

87.35 

27.31 

60.04 

60 

140 

222.24 

85.34 

136.90 

27 

63 

86.09 

31.18 

54.91 

80 

120 

219.35 

93.90 

125.45 

36 

54 

85.30 

35.28 

50.02 

100 

100 

217.83 

103.14 

114.69 

45 

45 

85.45 

39.86 

45.59 

120 

80 

217.35 

112.90 

104.45 

54 

36 

86.33 

44.80 

41.53 

140 

60 

217.88 

123.16 

94.72 

63 

27 

87.40 

49.83 

37.57 

160 

40 

219.49 

133.97 

85.52 

72 

18 

88.64 

54.95 

33.69 

180 

20 

221.66 

145.05 

76.61 

81 

9 

90.80 

60.53 

30.27 

200 

0 

224.83 

156.64 

68.19 

90 

0 

93.75 

66.51 

27.24 


Comparing the results in Table 7.3 with the results in Table 7.7 (the 
experimental results are plotted in Fig. 7.6(b) and Fig. 7.7(b) respectively), 
we can see that NASM and NAAM are both superior to FASM and FAAM 
in both datasets. 

In the following, we will perform other models, such as AR models and 
RBF network, on the above two datasets. The best results of all the models 
are illustrated in Fig. 7.6(a) for HSI and Fig. 7.7(a) respectively. 

7.5.2.2 AR Models 

For AR models, we use the AR model with order 4 to predict the prices of 
HSI and DJIA, hence we can compare the AR model with NASM, NAAM in 
SVR with the same order. The results are listed in the Table 7.8. From these 
results, we can see that NASM and NAAM are superior to AR model with 
the same order. 
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Fig. 7.6. Experimental results of HSI 



(a) NAAM. NASM vs. AR(4). RBF(7) 


(b) NAAM, NASM vs. FAAM. FASM 


Fig. 7.7. Experimental results of DJIA 


Table 7.8. Results on AR(4) 


Dataset 

MAE 

UMAE 

DMAE 

HSI 

217.75 

105.96 

111.79 

DJIA 

88.74 

46.36 

42.38 


7.5.2.3 RBF Network 

For the RBF network, we use the RBF network which was implemented in 
NETLAB [11] and perform the one-step-ahead prediction to predict the prices 
of HSI and DJIA. Concretely, we let other parameters as default and set the 
number of hidden units to 3, 5, 7, 9 to learn / by training the RBF network 
on the training samples, and obtain the results in Table 7.9 for both datasets. 
Comparing the results in Table 7.3 with the results in Table 7.9, we can see 
that NASM and NAAM are also better than RBF network. 
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Table 7.9. Effect of number of hidden units on HSI and DJIA 


Hidden No. 

HSI 

DJIA 


MAE 

UMAE 

DMAE 

MAE 

UMAE 

DMAE 

3 

386.65 

165.08 

221.57 

88.31 

44.60 

43.71 

5 

277.83 

128.92 

148.91 

98.44 

48.46 

49.98 

7 

219.32 

104.15 

115.17 

90.53 

46.22 

44.31 

9 

221.81 

109.46 

112.35 

87.23 

44.09 

43.14 


7.5.3 GARCH 

In this experiment, the experimental data are 3 years’ daily closing indices 
(2000-2002) from stock markets in different countries: 

Nikkei225: Nikkei225 Stock Average from Japan, the daily closing prices 
are plotted in Fig. 7.11(a); 

DJIA00-02: Dow Jones Industrial Average (DJIA) from USA, the daily 
closing prices are plotted in Fig. 7.13(a); 

FTSE100: FTSE100 index from UK, the daily closing prices are plotted 
in Fig. 7.15(a). 

In the data processing step, the daily closing prices of these indices are 
converted to continuously compounded returns and the ratio of the number 
of training data to the number of testing data is set to 5:1. Therefore, we 
obtain and list the corresponding training and testing periods in Table 7.10. 


Table 7.10. GARCH experimental data description 


Indices 

Training period 

Testing period 

Nikkei225 

4 Jan., 2000 - 2 Jul., 2002 

4 Jul., 2002 - 30 Dec., 2002 

DJIA00-02 

3 Jan., 2000 - 3 Jul., 2002 

5 Jul., 2002 - 31 Dec., 2002 

FTSE100 

4 Jan., -2000 - 3 Jul., 2002 

4 Jul., 2002 - 31 Dec., 2002 


7.5.3.1 GARCH{ 1,1) 

We apply the Matlab toolbox to calculate the GARCH model. In the Matlab 
toolbox, Before running the SVR algorithm, we run the GARCH( 1,1) model 
to determine the width of margin in SVR. For Nikkei225, we obtain the 
parameter estimates and their standard errors in Table 7.11, i. e. the best fits 
for Nikkei225 by (1,1) is: 
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Ut = 0.49468 + e t , 

of = 0.00073917 + 0.8682af_ 1 + 0.077218e?_ 1 . 


Table 7.11. GARCH parameter for Nikkei225 




Standard 

T 

Parameter 

Value 

error 

statistic 

co 

0.49468 

0.0045008 

109.9083 

KO 

0.00073917 

0.00034866 

2.1200 

GARCH(l) 

0.8682 

0.048144 

18.0334 

ARCH(l) 

0.077218 

0.027279 

2.8306 


We also show that the log-likelihood contours of GARCH( 1,1) model 
fit to the returns of dataset, Nikkei225 Fig. 7.8(a) The log-likelihood con¬ 
tours are plotted in a GARCH coefficient-ARCH coefficient (Gi — Ai) plane, 
holding the parameters Cq and kq fixed at their maximum likelihood esti¬ 
mates 0.49468 and 0.00073917, respectively. The contours confirm the results 
in Table 7.11. The maximum log-likelihood value occurs at the coordinates 
Gi = GARCH( 1) = 0.8682 and A x = ARCH( 1) = 0.077218. This figure also 
reveals a highly negative correlation between the estimates of the Gi and 
A\ parameters of the GARCH(1,1) model. It implies that a small change in 
the estimate of the Gi parameter is nearly compensated for a corresponding 
change of opposite sign in the A\ parameter. The innovations, standard de¬ 
viations (at) and returns of Nikkei225 are shown in Fig. 7.8(b). 



GARCH coefficient 

(a) GARCH!, 1.1) log-likelihood contours 
of Nikkei225 



(b) Innovations, conditional standard deviations 
and returns of Nikkei225 


Fig. 7.8. GARCH( 1,1) of Nikkei225. The color-coded bar at the right of (a) indi¬ 
cates the height of the log-likelihood surface of the GARCH( 1,1) plane 






















7.5 Experiments 151 


For dataset DJIA00-02, GARCH(1,1) parameter estimates are listed in 
Table 7.12, therefore, the best fits for DJIA00-02 by GARCH( 1,1) is 

y t = 0.60363 + e t , 

= 0.00056832 + 0.85971 a\_ x + 0.0922956^! . 


Table 7.12. GARCH parameter for DJIA00-02 




Standard 

T 

Parameter 

Value 

error 

statistic 

Co 

0.60363 

0.0041185 

146.5631 

K 0 

0.00056832 

0.00023491 

2.4193 

GARCH( 1) 

0.85971 

0.031773 

27.0580 

ARCH( 1) 

0.092295 

0.020352 

4.5350 


The corresponding log-likelihood contours of DJIA00-02 are plotted in 
Fig. 7.9(a), the maximum log-likelihood value occurs at the coordinates 
G\ = GARCH( 1) = 0.85971 and A x = ARCH( 1) = 0.09229. The cor¬ 
responding innovations, standard deviation and returns of DJIA00-02 are 
shown in Fig. 7.9(b). 



(a) GARCH( 1.1) log-likelihood contours 
of FTSEI00 
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(b) Innovalions, conditional standard deviations 
and returns of FTSE100 


Fig. 7.9. GARCH( 1,1) of FTSE100. The color-coded bar at the right of (a) indi¬ 
cates the height of the log-likelihood surface of the GARCH( 1,1) plane 


For dataset FTSE100, GARCH( 1,1) parameter estimates are listed in 
Table 7.13 therefore, the best fits for FTSE100 by GARCH( 1,1) is 
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Vt = 0.50444 + e t , 

of = 0.0011599 + 0.82253of_ 1 + 0.12693c?.! . 


Table 7.13. GARCH parameter for FTSE100 




Standard 

T 

Parameter 

Value 

error 

statistic 

Co 

0.50444 

0.0053313 

94.6180 

K 0 

0.0011599 

0.00049206 

2.3573 

GARCH (1) 

0.82253 

0.04906 

16.7658 

ARCH{ 1) 

0.12693 

0.034698 

3.6582 


The corresponding log-likelihood contours of FTSE100 are plotted in 
Fig. 7.10(a). The maximum log-likelihood value occurs at the coordinates 
G\ = GARCH{ 1) = 0.82253 and A x = ARCH{ 1) = 0.12693. The corre¬ 
sponding innovations, standard deviation and returns of FTSE100 are shown 
in Fig. 7.10(b). 
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(b) Innovations, conditional standard deviations 
and returns of DJIAOO-02 


Fig. 7.10. GARCH( 1,1) of DJIA00-02. The color-coded bar at the right of (a) 
indicates the height of the log-likelihood surface of the GARCH(1,1) plane 


7.5.3.2 SVR Algorithm 

For SVR algorithm, the experimental procedure consists of three steps: at 
first, we normalize the return value by ti = (n — n ow )/(rhigh — How), where r, 
is the actual return of the stock at day i, ri ow and rhigh are the correspond¬ 
ingly minimum and maximum return in the training data, respectively. Then, 
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we train the normalized training data once and then obtain the normalized 
predicted return value p Ui = f(xi), where x-i = (tj_ 4 , t,_ 3 , U- 2 , U-i). Finally, 
we unnormalize p ni , convert the result to price and obtain the corresponding 
predicted price pi. 

Before running the SVR algorithm, we have to choose two parameters: C, 
the cost of error; (3, the parameter of kernel function. Here the parameters 
we choose are the same respectively for different indices. They are listed in 
Table 7.14. 


Table 7.14. Parameters in GARCH experiments for NASM 


Indices 

C 

P 

Nikkei225 

2 

2 -4 

DJIA 

2 

2-4 

FTSE100 

2 

2-4 


Here, we just consider the case of NASM. The margin setting is as 
Eq.(7.13). Concretely, we set the margin width to a calculated by GARCH{ 1,1) 
from return series y, therefore Ai = A 2 = 1/2 and p = 0. For fixed 
margin cases, we set the margin width as 0.1, i. e. u(x) + d{x) = 0.1, 
and each increment is 0.02. The corresponding results are shown in the 
Tables 7.15—7.17. We also plot the training and testing data results of NAAM 
in Figs. 7.12(a) and 7.12(b) for index Nikkei225, in Figs. 7.14(a) and 7.14(b) 
for index DJIA00-02, in Figs. 7.16(a) and 7.16(b) for index FTSE100, re¬ 
spectively. From these results, we can see that for FTSE100 index, NASM 
outperforms in the prediction than in fixed margin cases. For Nikkci225, when 
u{x) = 0.06, d(x) = 0.04 and u(x ) = 0.08, d(x) = 0.02, the predicted results 
are better than NASM. For DJIA00-02, when u(x) = 0.06, d(x) = 0.04, the 
predicted result is slightly better than NASM. 

7.5.3.3 AR Models 

We also use AR model with different orders (1—6) to predict the prices of the 
above three indices. The experimental procedure is to apply the AR model on 
training return series and to obtain the predicted return value from testing 
data. Then we convert the predicted return values to price values. We obtain 
the experimental results and show them in Table 7.18. After comparing the 
results in Tables 7.15 and 7.17 with the results in 2—4 and 8—10 columns 
of Table 7.18, we can see that for Nikkei225 and FTSE100 index, the NASM 
method is better than AR model. For DJIA, we can see that NASM method 
is slight worse than AR(1), but better than other order of AR model. 

For index Nikkei225, the predictive error and risks comparison results 
graphs are shown in Fig. 7.11(b), the corresponding bar values are from 
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Table 7.15. SVR results for Nikkei225 


Type 

u(x) 

d(x) 

MAE 

UMAE 

DMAE 

NASM 

a 

a 

124.37 

55.97 

68.40 


0 

0.10 

141.60 

30.70 

110.90 


0.02 

0.08 

131.25 

39.02 

92.23 

FAAM 

0.04 

0.06 

125.63 

49.66 

75.97 


0.06 

0.04 

123.11 

61.81 

61.30 


0.08 

0.02 

124.00 

75.63 

48.37 


0.10 

0 

129.19 

91.56 

37.63 


Table 7.16. SVR results for DJIA00-02 


Type 

u(x) 

d(x) 

MAE 

UMAE 

DMAE 

NASM 

a 

a 

129.56 

62.74 

66.83 


0 

0.10 

139.82 

41.56 

98.26 


0.02 

0.08 

134.33 

49.16 

85.17 

FAAM 

0.04 

0.06 

130.49 

57.56 

72.93 


0.06 

0.04 

128.51 

66.87 

61.64 


0.08 

0.02 

129.65 

77.72 

51.94 


0.10 

0 

133.76 

90.02 

43.74 


Table 7.17. SVR results for FTSE100 


Type 

u(x) 

d(x) 

MAE 

UMAE 

DMAE 

NASM 

<J 

a 

69.61 

33.42 

36.19 


0 

0.10 

73.46 

25.93 

47.53 


0.02 

0.08 

71.98 

28.52 

43.46 

FAAM 

0.04 

0.06 

70.83 

31.27 

39.56 


0.06 

0.04 

70.10 

34.22 

35.88 


0.08 

0.02 

69.86 

37.42 

32.45 


0.10 

0 

70.26 

40.92 

29.34 
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Table 7.18. AR results 


Order 

Nikkei225 

DJIA00-02 

FTSE100 


MAE 

UMAE 

DMAE 

MAE 

UMAE 

DMAE 

MAE 

UMAE 

DMAE 

1 

125.31 

53.40 

71.91 

128.58 

61.67 

66.91 

71.44 

33.9 

37.53 

2 

125.68 

53.31 

72.36 

130.00 

62.08 

67.92 

71.40 

33.46 

37.94 

3 

125.67 

53.37 

72.30 

130.56 

62.50 

68.06 

70.41 

32.76 

37.65 

4 

125.22 

52.91 

72.31 

131.20 

62.93 

68.27 

69.96 

32.76 

37.20 

5 

125.32 

53.08 

72.24 

131.27 

62.90 

68.38 

70.12 

32.89 

37.23 

6 

125.40 

52.72 

72.68 

131.32 

62.89 

68.43 

69.99 

32.78 

37.21 


Table 7.15 and (2—4 columns of) Table 7.18. The predictive error and risks 
of DJIA00-02 are shown in Fig. 7.13(b), where the corresponding bar values 
are from Table 7.16 and (5—7 columns of) Table 7.18. The predictive error 
and risks of FTSE100 are shown in Fig. 7.15(b), where the corresponding bar 
values are from Table 7.17 and (8—10 columns of) Table 7.18. 


xIO' 



(a) Data plot 



MAE UMAE DMAE 

(b) Results comparison 


Fig. 7.11. Nikkei225 data plot and experimental results graphs 


7.6 Discussions 

Having described the experiments and their results, we know that NASM is 
superior to FASM and FAAM generally. One reason is that NASM catches 
the stock market information and adds the information into the setting of the 
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Fig. 7.12. Experimental results graphs using GARCH method for Nikkei225 



(a) Data plot 



MAE l \l M l>M \l 

(b) Results comparison 


Fig. 7.13. DJIA00-02 data plot and experimental results graphs 




Fig. 7.14. Experimental results graphs using GARCH method for DJIA00-02 
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(a) Data plot (b) Results comparison 

Fig. 7.15. FTSE100 data plot and experimental results graphs 




(a) Training results (b) Testing results 


Fig. 7.16. Experimental results graphs using GARCH method for FTSE100 


margin. This provides helpful information for the prediction. Another reason 
is that by using NASM, the margin width is determined by a meaningful 
value. This value changes with the stock market. Obviously, this method is 
more flexible than fixed margin cases and avoids risk of getting bad predictive 
results partially when the margin values are determined by random selection 
in the fixed margin cases. 

Furthermore, we know that NAAM may be better than NASM. For 
example, by adding a momentum, we may not only improve the accuracy 
of prediction, but also reduce the predictive downside risk. 

Another notice is that by cautiously selecting parameters, SVR algorithm 
has similar predictive performance to other models, from Figs. 7.6(a) and 
7.7(a). However, for a novice, the SVR libraries are easy to run. Since every 
local optimum is the global optimum, it guarantees the user to find an optimal 
solution easily and stably. This advantage is very useful for a novice to learn 
a new model, or library, and strengthen his confidence of learning new things 
comparing with learning other non-linear model, e. g. RBF networks. 
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In general, our methods can be considered as a model selection, deter¬ 
mining the parameter, e. We do not consider the setting of other parameters, 
such as C and j3. We just use the cross-validation technique to find suitable 
values for them. However, this procedure is time-consuming. We may add 
some market information to set these parameters, e.g. [4]. In addition, the 
margin width set by GARCH model is too wide; we may need to add more 
useful terms to shrink it. This can be one of our future works. A valuable 
experience is that the normalized procedure will be helpful for selecting suit¬ 
able parameters easily and stably. 

Finally, we turn to a key weakness of our model: the predictive model 
does not lead to direct profit making in real life and we do not provide the 
confidence of these predictive models. However, we may find some useful 
information through using our model to predict the stock market prices; the 
predictive results may provide some helpful suggestions. 
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Conclusion and Future Work 


In this chapter, a summary of this book is provided. We will review the whole 
journey of this book, which starts from two schools of learning thoughts 
in the literature of machine learning, and then motivate the resulting com¬ 
bined learning thought including Maxi-Min Margin Machine, Minimum Error 
Minimax Probability Machine and their extensions. Following that, we then 
present both future perspectives within the proposed models and beyond the 
developed approaches. 


8.1 Review of the Journey 

Two paradigms exist in the literature of machine learning. One is the 
school of global learning approaches; the other is the school of local learning 
approaches. Global learning enjoys a long and distinguished history, which 
usually focuses on describing phenomena by estimating a distribution from 
data. Based on the estimated distribution, the global learning methods can 
then perform inferences, conduct marginalizations, and make predictions. 
Although containing many good features, e.g. a relatively simple optimiza¬ 
tion and the flexibility in incorporating global information such as structure 
information and invariance, etc., these learning approaches have to assume a 
specific type of distribution a prior. However, in general, the assumption itself 
may be invalid. On the other hand, local learning methods do not estimate 
a distribution from data. Instead, they focus on extracting only the local 
information which is directly related to the learning task, i.e. the classifica¬ 
tion in this book. Recent progress following this trend has demonstrated that 
local learning approaches, e.g. Support Vector Machine (SVM), outperform 
the global learning methods in many aspects. Despite of the success, local 
learning actually discards plenty of important global information on data, 
e.g. the structure information. Therefore, this restricts the performance of 
this types of learning schemes. Motivated from the investigations of these 
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two types of learning approaches, we therefore suggest to propose a hybrid 
learning framework. Namely, we should learn from data globally and locally. 

Following the hybrid learning thought, we thus develop a hybrid model 
named Maxi-Min Margin Machine (M 4 ), which successfully combines two 
largely different but complementary paradigms. This new model is demon¬ 
strated to contain both appealing features in global learning and local learn¬ 
ing. It can capture the global structure information from data, while it can 
also provide a task-oriented scheme for the learning purpose and inherit the 
superior performance from local learning. This model is theoretically im¬ 
portant in the sense that M 4 contains many important learning models as 
special cases including Support Vector Machines, Minimax Probability Ma¬ 
chine (MPM), and Fisher Discriminant Analysis; the proposed model is also 
empirically promising in that it can be cast as a Sequential Second Order 
Cone Programming problem yielding a polynomial time complexity. 

The idea of learning from data locally and globally is also applicable in 
regression tasks. Directly motivated from the Maxi-Min Margin Machine, a 
new regression model named Local Support Vector Regression (LSVR) is 
proposed in this book. LSVR is demonstrated to provide a systematic and 
automatic scheme to locally and flexibly adapt the margin which is globally 
fixed in the standard Support Vector Regression (SVR), a state-of-the-art 
regression model. Therefore, it can tolerate the noise adaptively. The pro¬ 
posed LSVR is promising in the sense that it not only captures the local 
information of the data in approximating functions, but more importantly, 
includes special cases, which enjoy a physical meaning very much similar to 
the standard SVR. Both theoretical and empirical investigations demonstrate 
the advantages of this new model. 

Besides the above two important models, another important contribution 
of this book is that we also develop a novel global learning model called 
Minimum Error Minimax Probability Machine (MEMPM). Although still 
within the framework of global learning, this model does not need to assume 
any specific distribution beforehand and represents a distribution-free Bayes 
optimal classifier in a worst-case scenario. This thus makes the model distin¬ 
guished from the traditional global learning models, especially the traditional 
Bayes optimal classifier. One promising feature of MEMPM is that it can 
derive an explicit accuracy bound under a mild condition, leading to a good 
generalization performance for future data. 

The fourth contribution of this book is that we develop the Biased Mini¬ 
max Probability Machine (BMPM) model. Even though it is a special case of 
MEMPM, we highlight this model because BMPM provides the first system¬ 
atic and rigorous approach for a kind of important learning tasks, namely, the 
biased learning or imbalanced learning. Different from traditional imbalanced 
(biased) learning methods, BMPM can quantitatively and explicitly incorpo¬ 
rate a bias for one class and consequently emphasize the more important 
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class. A series of experiments demonstrate that BMPM is very promising in 
imbalanced learning and medical diagnosis. 


8.2 Future Work 

The models developed in this book bridge the gap between local learning and 
global learning. This brings a new viewpoint for both existing local models 
and global models. Following the viewpoint of learning from data both glo¬ 
bally and locally, there seems to be a lot of immediate directions both inside 
and beyond the proposed models in this book. 

8.2.1 Inside the Proposed Models 

There are certainly a lot of work for improving the proposed models in this 
book. 

First, all the models proposed in this book including Minimum Error 
Minimax Probability Machine, Maxi-Min Margin Machine and Local Sup¬ 
port Vector Machine, involve in solving either a single Second Order Cone 
Programming or a Sequential Second Order Cone Programming problem. 
Although many optimization programs have demonstrated their good per¬ 
formance and mathematic tractability in solving this kind of problems, they 
are designed for general purposes and may not adequately exploit the spe¬ 
cific properties in our models. Therefore, it is highly possible and valuable to 
develop some special optimization algorithms for speeding up their training. 
In particular, Maxi-Min Margin Machine enjoys the feature of sparsity. By 
taking advantages of this property, researchers have developed fast optimiza¬ 
tion algorithms for Support Vector Machine. It is therefore very interesting to 
investigate whether similar procedures can be applied here. This interesting 
topic deserves much attention and remains to be an open problem. 

Second, an immediate problem for Minimum Error Minimax Probability 
Machine is the possible presence of local optimum in the practical optimiza¬ 
tion procedures. While empirical evidence shows that the global optimum 
can be attained in most of cases, the local optimum may occur when two 
types of data are not well-separated. Conventional simulated annealing [6, 14] 
or deterministic annealing methods [11, 12] are certainly possible ways to 
attack this problem, however a formal approach that is either a regularization 
augment or an algorithmic approximation may be proved more appropriate. 

Third, as shown in this book, all the proposed models apply the ker- 
nelization trick to extend their applications to nonlinear tasks. However, it 
is well known that some global information, e.g. the structure information, 
may not be well kept when the data are mapped from the original space to 
the feature space. This may restrict the power of learning from data both 
globally and locally. Motivated from this view, it is thus highly valuable to 
develop techniques to retain the global information of data when performing 
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the projection from the original space to the feature space. This can also 
be considered as a task on how to choose a suitable kernel, which currently 
attracts much interest in the machine learning community [4, 15]. 

Another important future direction for the proposed classification models, 

i.e. Minimum Error Minimax Probability Machine and Maxi-Min Margin 
Machine, is how to extend the current binary classifications into multi-way 
classifications. Although one vs. all and one vs. one [1, 16] approaches present 
the main tools for conducting the upgrading, one always prefers to a more 
systematic and more rigorous approach. 

8.2.2 Beyond the Proposed Models 

Although several important models have been motivated and developed from 
the viewpoint of learning from data both globally and locally, beyond these 
models there are plenty of work deserving future investigations. 

One natural question is whether other famous local models or global mod¬ 
els can be extended by engaging the viewpoint of learning from data globally 
and locally. For example, Neural Networks, a large family of popular learning 
models, might be also considered as modelling data in a local fashion. It is 
therefore very interesting to investigate whether global information can also 
be incorporated into these kinds of learning processes. 

It is noted that the learning discussed in this book is restricted within 
the framework of either classification or regression tasks. Both tasks belong 
to the so-called supervised learning [5, 9, 18]. However, the other largely 
different learning paradigm, unsupervised learning [10, 13, 17], and the re¬ 
cently emerging semi-supervised learning [2, 3, 8, 7] are not considered. There¬ 
fore, exploring possible applications of hybrid learning in this field presents 
a straightforward and immediate ongoing topic. 
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